<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://dmer-group.example.com/news</id>
    <title>DMER Group Blog</title>
    <updated>2026-01-10T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://dmer-group.example.com/news"/>
    <subtitle>DMER Group Blog</subtitle>
    <icon>https://dmer-group.example.com/img/favicon.ico</icon>
    <entry>
        <title type="html"><![CDATA[MR-KG: A Knowledge Graph of Mendelian Randomization Evidence Powered by Large Language Models]]></title>
        <id>https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph</id>
        <link href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph"/>
        <updated>2026-01-10T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We describe a new resource — MR-KG — that automates extraction of Mendelian randomization evidence using large language models and packs it into a structured knowledge graph.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-background">📌 Background<a href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph#-background" class="hash-link" aria-label="Direct link to 📌 Background" title="Direct link to 📌 Background" translate="no">​</a></h2>
<p>Mendelian randomization (MR) is a powerful causal inference method that uses genetic variants as natural experiments to assess causal relationships between putative risk factors and disease outcomes. MR studies are increasingly abundant, but synthesising evidence across them remains challenging due to heterogeneity in reporting, traits examined, and the structure of the published literature.</p>
<p>To address this, <em>Liu, Burton, Gatua, Hemani &amp; Gaunt (2025)</em> introduce <strong><a href="https://epigraphdb.org/mr-kg/" target="_blank" rel="noopener noreferrer" class="">MR-KG</a></strong> — a knowledge graph of MR evidence automatically extracted from published studies using <strong>large language models (LLMs)</strong>.</p>
<p><a href="https://doi.org/10.64898/2025.12.14.25342218" target="_blank" rel="noopener noreferrer" class="">Liu et al. "MR-KG: A knowledge graph of Mendelian randomization evidence powered by large language models". 2025, medRxiv DOI:10.64898/2025.12.14.25342218</a></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-what-is-mr-kg">🧠 What Is MR-KG?<a href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph#-what-is-mr-kg" class="hash-link" aria-label="Direct link to 🧠 What Is MR-KG?" title="Direct link to 🧠 What Is MR-KG?" translate="no">​</a></h2>
<p><img decoding="async" loading="lazy" alt="mr-kg-logo" src="https://dmer-group.example.com/assets/images/mr-kg-logo-9cf943db64f84f00fb8b8c843e3ba343.png" width="1200" height="800" class="img_ev3q"></p>
<p><strong>MR-KG</strong> is a structured, machine-readable network of results from Mendelian randomization studies. Instead of manually curating every causal estimate, this project uses state-of-the-art large language models to:</p>
<ul>
<li class=""><strong>Extract structured data</strong> (e.g., exposures, outcomes, effect estimates) from scientific text</li>
<li class=""><strong>Link entities</strong> such as traits, genetic instruments, and study metadata</li>
<li class=""><strong>Standardise the relationships</strong> so they can be interrogated at scale</li>
</ul>
<p>This makes MR evidence navigable by computational systems for downstream analysis, search, and reasoning.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-how-it-works">🛠️ How It Works<a href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph#%EF%B8%8F-how-it-works" class="hash-link" aria-label="Direct link to 🛠️ How It Works" title="Direct link to 🛠️ How It Works" translate="no">​</a></h2>
<p>The MR-KG pipeline has three major components:</p>
<ol>
<li class=""><strong>LLM-based Extraction</strong> — abstracts of published MR studies are processed through large language models to pull out structured triples (e.g., <em>[trait → outcome → causal effect]</em>).</li>
<li class=""><strong>Graph Construction and Storage</strong> — extracted results are normalised into a consistent schema and stored as a graph where nodes represent entities (traits, studies, variants) and edges represent relationships (e.g., causal evidence).</li>
<li class=""><strong>Interactive Access</strong> — a live web interface and API (e.g., via <em><a href="https://epigraphdb.org/mr-kg" target="_blank" rel="noopener noreferrer" class="">https://epigraphdb.org/mr-kg</a></em>) allow users and programs to query and explore the graph.</li>
</ol>
<p>The repository also integrates supplementary tools for quality control and similarity analyses between studies or traits.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-key-features">🔍 Key Features<a href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph#-key-features" class="hash-link" aria-label="Direct link to 🔍 Key Features" title="Direct link to 🔍 Key Features" translate="no">​</a></h2>
<ul>
<li class=""><strong>Automated MR evidence extraction</strong> reduces the need for manual curation.</li>
<li class=""><strong>Knowledge graph format</strong> represents complex relationships and enables sophisticated queries.</li>
<li class=""><strong>LLM-powered processing</strong> allows extraction from a wide range of publication styles and formats.</li>
<li class=""><strong>API &amp; web frontend</strong> for interactive use by researchers and software clients.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-why-it-matters">💡 Why It Matters<a href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph#-why-it-matters" class="hash-link" aria-label="Direct link to 💡 Why It Matters" title="Direct link to 💡 Why It Matters" translate="no">​</a></h2>
<p>MR-KG addresses a key bottleneck in genetic epidemiology: while MR generates important causal insights, the pace and volume of publication make it hard to synthesise evidence consistently.</p>
<p>A structured knowledge graph enables researchers to:</p>
<ul>
<li class=""><strong>Rapidly identify all MR evidence</strong> linking specific exposures and outcomes.</li>
<li class=""><strong>Detect overlapping or conflicting causal findings</strong>.</li>
<li class=""><strong>Integrate MR evidence with other biomedical resources</strong> for multidimensional analysis.</li>
</ul>
<p>This could accelerate causally informed hypothesis generation and help triangulate evidence across studies.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-getting-started">🚀 Getting Started<a href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph#-getting-started" class="hash-link" aria-label="Direct link to 🚀 Getting Started" title="Direct link to 🚀 Getting Started" translate="no">​</a></h2>
<p>The MR-KG project’s web interface and API are publicly available at <a href="https://epigraphdb.org/mr-kg/" target="_blank" rel="noopener noreferrer" class="">https://epigraphdb.org/mr-kg/</a>, allowing other teams to:</p>
<ul>
<li class="">Explore the current graph and extracted evidence</li>
<li class="">Build analytical tools on top of the graph</li>
<li class="">Contribute improvements to extraction models or schema</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-a-note-on-preprints">🧪 A Note on Preprints<a href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph#-a-note-on-preprints" class="hash-link" aria-label="Direct link to 🧪 A Note on Preprints" title="Direct link to 🧪 A Note on Preprints" translate="no">​</a></h2>
<p>This work is currently a <strong>preprint</strong>, meaning it has not yet been peer-reviewed. Preprints should be interpreted as early reports of research findings, and while valuable for rapid dissemination, they are preliminary.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="links">Links<a href="https://dmer-group.example.com/news/2026/01/10/mrknowledgegraph#links" class="hash-link" aria-label="Direct link to Links" title="Direct link to Links" translate="no">​</a></h2>
<ul>
<li class="">Preprint: <a href="https://doi.org/10.64898/2025.12.14.25342218" target="_blank" rel="noopener noreferrer" class="">https://doi.org/10.64898/2025.12.14.25342218</a></li>
<li class="">Project code repositories:<!-- -->
<ul>
<li class="">Data extraction pipeline <a href="https://github.com/MRCIEU/llm-data-extraction" target="_blank" rel="noopener noreferrer" class="">https://github.com/MRCIEU/llm-data-extraction</a></li>
<li class="">Knowledge graph construction and interface <a href="https://github.com/MRCIEU/mr-kg" target="_blank" rel="noopener noreferrer" class="">https://github.com/MRCIEU/mr-kg</a></li>
</ul>
</li>
</ul>]]></content>
        <author>
            <name>Yi Liu</name>
        </author>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <author>
            <name>with initial draft by AI</name>
        </author>
        <category label="knowledge graphs" term="knowledge graphs"/>
        <category label="Mendelian randomization" term="Mendelian randomization"/>
        <category label="large language models" term="large language models"/>
        <category label="epigraphdb" term="epigraphdb"/>
        <category label="database" term="database"/>
        <category label="NLP" term="NLP"/>
        <category label="preprint" term="preprint"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Genetics as a side‑effect detective for antipsychotic medicines]]></title>
        <id>https://dmer-group.example.com/news/genetic-side-effects-antipsychotics</id>
        <link href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics"/>
        <updated>2025-07-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We combined drug–receptor binding data with Mendelian randomization and colocalisation to tease apart which antipsychotic side‑effects are likely causal, and whether they’re driven by on‑target or off‑target biology.]]></summary>
        <content type="html"><![CDATA[<img src="https://dmer-group.example.com/img/side_effect_score_pipeline.png" alt="Schematic of the genetics + pharmacology pipeline used to infer drug side-effect mechanisms">
<p>Side‑effects are one of the main reasons people stop taking antipsychotic medicines — even when the drugs are helping with symptoms. But when someone reports “I’ve gained weight” or “my blood pressure has changed”, it’s often hard to know whether the drug truly caused it, <em>which biological target is responsible</em>, and whether that target is the one we wanted to hit in the first place.</p>
<p>In work led by Andrew Elmore, published in <strong><a href="https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1011793" target="_blank" rel="noopener noreferrer" class="">PLOS Genetics</a></strong>, we <strong>combine pharmacology (what receptors a drug binds) with human genetics (natural experiments) to map side‑effects back to specific receptors</strong>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-basic-idea">The basic idea<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#the-basic-idea" class="hash-link" aria-label="Direct link to The basic idea" title="Direct link to The basic idea" translate="no">​</a></h2>
<p>Antipsychotics don’t just bind a single receptor. They bind <em>many</em> — and some of those bindings are “on‑target” (part of how the drug works), while others are “off‑target” (biological collateral damage).</p>
<p>We built a framework that brings together:</p>
<ul>
<li class=""><strong>Drug–receptor binding affinities</strong> (how strongly each drug binds each receptor)</li>
<li class=""><strong>Reported side‑effects</strong> from a large reference database</li>
<li class=""><strong>Genetic instruments</strong> for receptor/gene activity (eQTLs)</li>
<li class=""><strong>GWAS traits</strong> that can stand in for side‑effects</li>
<li class=""><strong>Mendelian randomization (MR)</strong> + <strong>genetic colocalisation</strong> to strengthen causal interpretation</li>
</ul>
<p>The output is a simple summary: a <strong>side‑effect score</strong> for each <em>Drug × Receptor × Trait</em> combination, which we can then aggregate to compare drugs and mechanisms.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-analysed">What we analysed<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#what-we-analysed" class="hash-link" aria-label="Direct link to What we analysed" title="Direct link to What we analysed" translate="no">​</a></h2>
<p>We focused on <strong>six commonly prescribed antipsychotics</strong> (including clozapine, olanzapine and risperidone). Across these drugs we identified <strong>68 receptors</strong> with evidence of binding, and started from <strong>165 reported side‑effects</strong> — of which <strong>80 could be genetically proxied</strong> using available GWAS in OpenGWAS.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-found">What we found<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#what-we-found" class="hash-link" aria-label="Direct link to What we found" title="Direct link to What we found" translate="no">​</a></h2>
<p>A few results stood out:</p>
<ul>
<li class="">We identified <strong>36 side‑effects</strong> that look likely to be caused by drug action through <strong>30 receptors</strong>.</li>
<li class="">The bulk of evidence pointed to <strong>off‑target mechanisms</strong>.</li>
<li class=""><strong>Clozapine</strong> showed the <strong>largest cumulative side‑effect profile</strong> and the <strong>largest number of scored side‑effects</strong> in this framework.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="three-concrete-examples">Three concrete examples<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#three-concrete-examples" class="hash-link" aria-label="Direct link to Three concrete examples" title="Direct link to Three concrete examples" translate="no">​</a></h2>
<p>The paper walks through three side‑effects in detail (and these nicely illustrate how the approach can generate mechanistic hypotheses).</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-neutropenia-dangerously-low-white-blood-cell-counts">1) Neutropenia (dangerously low white blood cell counts)<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#1-neutropenia-dangerously-low-white-blood-cell-counts" class="hash-link" aria-label="Direct link to 1) Neutropenia (dangerously low white blood cell counts)" title="Direct link to 1) Neutropenia (dangerously low white blood cell counts)" translate="no">​</a></h3>
<p>Clozapine is clinically linked to rare but serious neutropenia. In our genetic scoring, the signal for neutropenia was strongest for clozapine and suggested contributions from targets including <strong>GABRA1</strong> and <strong>HTR1B</strong>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-weight-gain">2) Weight gain<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#2-weight-gain" class="hash-link" aria-label="Direct link to 2) Weight gain" title="Direct link to 2) Weight gain" translate="no">​</a></h3>
<p>Weight gain is a major concern for patients and a common reason for discontinuation. Our results again highlighted clozapine (and olanzapine) and suggested that differences in binding to targets such as <strong>CHRM3</strong> and <strong>HRH1</strong> could help explain why these drugs tend to have larger weight‑gain profiles than others.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-blood-pressure-effects">3) Blood pressure effects<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#3-blood-pressure-effects" class="hash-link" aria-label="Direct link to 3) Blood pressure effects" title="Direct link to 3) Blood pressure effects" translate="no">​</a></h3>
<p>Clinical evidence on antipsychotics and blood pressure can be mixed, and we saw that different receptors imply different directions of effect. In the scoring, blood‑pressure signals were strongly influenced by <strong>HRH1</strong>, consistent with this receptor being a plausible driver for some blood‑pressure changes.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-matters">Why this matters<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#why-this-matters" class="hash-link" aria-label="Direct link to Why this matters" title="Direct link to Why this matters" translate="no">​</a></h2>
<p>What we think is important about this work is the <em>transferability</em>:</p>
<ul>
<li class="">It provides a <strong>framework to triage side‑effects early</strong> (even before large trials, when we have target information and genetics).</li>
<li class="">It can help separate “likely causal” side‑effects from those that may be coincidental, comorbidity‑driven, or reporting artefacts.</li>
<li class="">It gives <strong>mechanistic handles</strong>: if a side‑effect seems driven by an off‑target receptor, that receptor becomes a candidate to avoid in future drug design.</li>
</ul>
<p>This won’t replace clinical pharmacovigilance. But it could help researchers ask better questions — and focus lab, trial, and monitoring efforts where the biology is most convincing.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="caveats-we-should-keep-in-mind">Caveats we should keep in mind<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#caveats-we-should-keep-in-mind" class="hash-link" aria-label="Direct link to Caveats we should keep in mind" title="Direct link to Caveats we should keep in mind" translate="no">​</a></h2>
<p>A few limitations are worth emphasising:</p>
<ul>
<li class="">Some receptor families (notably parts of the <strong>GABA</strong> system) have incomplete binding‑affinity data, which affects how confidently we can compare magnitudes across drugs.</li>
<li class="">The genetic resources used are <strong>predominantly European‑ancestry</strong>, so we should be careful about generalising across populations.</li>
<li class="">Pharmacokinetics (dose, tissue penetration, blood–brain barrier) and real‑world reporting can complicate “genetics → drug” translation.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="links">Links<a href="https://dmer-group.example.com/news/genetic-side-effects-antipsychotics#links" class="hash-link" aria-label="Direct link to Links" title="Direct link to Links" translate="no">​</a></h2>
<ul>
<li class="">Paper (open access): <a href="https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1011793" target="_blank" rel="noopener noreferrer" class="">https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1011793</a></li>
<li class="">PubMed record: <a href="https://pubmed.ncbi.nlm.nih.gov/40720497/" target="_blank" rel="noopener noreferrer" class="">https://pubmed.ncbi.nlm.nih.gov/40720497/</a></li>
<li class="">Code example (linked from the paper): <a href="https://github.com/andrew-e/side-effect-score" target="_blank" rel="noopener noreferrer" class="">https://github.com/andrew-e/side-effect-score</a></li>
</ul>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <author>
            <name>with initial draft by AI</name>
        </author>
        <category label="MR" term="MR"/>
        <category label="OpenGWAS" term="OpenGWAS"/>
        <category label="drug safety" term="drug safety"/>
        <category label="pharmacogenomics" term="pharmacogenomics"/>
        <category label="mental health" term="mental health"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[M-PreSS: a transparent, open-source approach to study screening in systematic reviews]]></title>
        <id>https://dmer-group.example.com/news/m-press-preprint</id>
        <link href="https://dmer-group.example.com/news/m-press-preprint"/>
        <updated>2025-06-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Fine-tuning an open biomedical language model (BlueBERT) with a Siamese network to generalise title/abstract screening across review topics, reducing workload while keeping methods transparent.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/m-press-preprint#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>Screening thousands of titles and abstracts is often the single biggest bottleneck in a systematic review workflow. In this new medRxiv pre-print, we describe <strong>M-PreSS</strong>: a model pre-training approach that aims to make screening faster <em>without</em> relying on closed, black-box systems.</p>
<p>The key idea is to start from an <strong>open biomedical language model (BlueBERT)</strong> and fine-tune it for screening using a <strong>Siamese neural network</strong> setup, so that the resulting model can generalise across different review topics rather than needing a brand-new model each time.</p>
<p><a href="https://doi.org/10.1101/2025.04.08.25325463" target="_blank" rel="noopener noreferrer" class="">Xu et al. "M-PreSS: A Model Pre-training Approach for Study Screening in Systematic Reviews". 2025, medRxiv DOI:10.1101/2025.04.08.25325463</a></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-did">What we did<a href="https://dmer-group.example.com/news/m-press-preprint#what-we-did" class="hash-link" aria-label="Direct link to What we did" title="Direct link to What we did" translate="no">​</a></h2>
<p>In M-PreSS, we fine-tuned BlueBERT to produce representations of study records (titles/abstracts) that can be used to score relevance for screening decisions. We then evaluated several training strategies in <strong>seven COVID-19 systematic reviews</strong>, focusing on whether a model trained on some topics could transfer to another topic.</p>
<p>Two practical variations explored in the preprint are:</p>
<ul>
<li class=""><strong>Enriching the “topic definition”</strong> used for training by adding explicit study selection criteria (the kind you would normally write in a protocol).</li>
<li class=""><strong>Training on more related review topics</strong>, to encourage broader generalisation.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-found">What we found<a href="https://dmer-group.example.com/news/m-press-preprint#what-we-found" class="hash-link" aria-label="Direct link to What we found" title="Direct link to What we found" translate="no">​</a></h2>
<p>Across the seven COVID-19 reviews, the approach showed <strong>good cross-topic performance</strong>:</p>
<ul>
<li class="">Average <strong>recall/sensitivity</strong> was reported as <strong>0.86</strong> (range <strong>0.67–1.00</strong>).</li>
<li class="">Average <strong>false positive rate</strong> was <strong>6.48%</strong> (range <strong>1.38%–11.41%</strong>).</li>
</ul>
<p>Two additional findings are especially relevant if you are thinking about deploying screening models in real review pipelines:</p>
<ul>
<li class="">Adding <strong>study selection criteria</strong> into the topic definition improved <strong>precision–recall performance (PRAUC)</strong> by <strong>2.74%</strong>.</li>
<li class="">Adding <strong>more related topics</strong> during training increased performance by <strong>15.82%</strong>.</li>
</ul>
<p>We also report that, in the COVID-19 topics we compared against, this fine-tuned open model can <strong>outperform ChatGPT/GPT-4 in two out of three</strong> previously reported screening settings, while using substantially fewer computational resources.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-matters">Why this matters<a href="https://dmer-group.example.com/news/m-press-preprint#why-this-matters" class="hash-link" aria-label="Direct link to Why this matters" title="Direct link to Why this matters" translate="no">​</a></h2>
<p>From our perspective, this work lands in a useful “sweet spot”:</p>
<ul>
<li class=""><strong>Transparent and reproducible</strong>: the underlying model is open, and the training approach can be documented and rerun.</li>
<li class=""><strong>Generalises across topics</strong>: rather than building a bespoke model from scratch for every review.</li>
<li class=""><strong>Practical levers to improve performance</strong>: especially the finding that writing selection criteria in a structured way can directly help the model.</li>
</ul>
<p>That combination is important if we want screening automation to be something review teams can actually trust, maintain, and update over time.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="limitations-and-next-steps">Limitations and next steps<a href="https://dmer-group.example.com/news/m-press-preprint#limitations-and-next-steps" class="hash-link" aria-label="Direct link to Limitations and next steps" title="Direct link to Limitations and next steps" translate="no">​</a></h2>
<p>A couple of things we will be considering as we continue to work in this space:</p>
<ul>
<li class=""><strong>Beyond COVID-19</strong>: the evaluation focuses on COVID-19 reviews, so it will be interesting to see how well the approach transfers to other domains (e.g. nutrition, cancer epidemiology, environmental exposures).</li>
<li class=""><strong>Human-in-the-loop integration</strong>: the biggest real-world gains often come from pairing models with active learning, prioritisation, and clear stopping rules—how M-PreSS plugs into those workflows will matter.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-a-note-on-preprints">🧪 A Note on Preprints<a href="https://dmer-group.example.com/news/m-press-preprint#-a-note-on-preprints" class="hash-link" aria-label="Direct link to 🧪 A Note on Preprints" title="Direct link to 🧪 A Note on Preprints" translate="no">​</a></h2>
<p>This work is currently a <strong>preprint</strong>, meaning it has not yet been peer-reviewed. Preprints should be interpreted as early reports of research findings, and while valuable for rapid dissemination, they are preliminary.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="links">Links<a href="https://dmer-group.example.com/news/m-press-preprint#links" class="hash-link" aria-label="Direct link to Links" title="Direct link to Links" translate="no">​</a></h2>
<ul>
<li class="">Preprint (v2): <a href="https://www.medrxiv.org/content/10.1101/2025.04.08.25325463v2" target="_blank" rel="noopener noreferrer" class="">https://www.medrxiv.org/content/10.1101/2025.04.08.25325463v2</a></li>
<li class="">Project code repository: <a href="https://github.com/automation-in-systematic-reviews/M-PreSS" target="_blank" rel="noopener noreferrer" class="">https://github.com/automation-in-systematic-reviews/M-PreSS</a></li>
</ul>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <author>
            <name>with initial draft by AI</name>
        </author>
        <category label="systematic reviews" term="systematic reviews"/>
        <category label="automation" term="automation"/>
        <category label="machine learning" term="machine learning"/>
        <category label="NLP" term="NLP"/>
        <category label="preprint" term="preprint"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Integrating Mendelian randomization and literature mining to map breast cancer risk factors]]></title>
        <id>https://dmer-group.example.com/news/mr-literature-breast-cancer-risk-factors</id>
        <link href="https://dmer-group.example.com/news/mr-literature-breast-cancer-risk-factors"/>
        <updated>2025-05-31T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We combine MR-EvE (everything-vs-everything Mendelian randomization) with literature-mined evidence in EpiGraphDB to generate and test hypotheses about breast cancer risk factors and their potential intermediates.]]></summary>
        <content type="html"><![CDATA[<p><img decoding="async" loading="lazy" alt="Illustration of integrating MR and literature-mined evidence to identify breast cancer risk pathways." src="https://dmer-group.example.com/assets/images/mr_lbd_breast_cancer-09cb5d8c424fc7ee47d4f421c470674a.png" width="2306" height="2452" class="img_ev3q"></p>
<p>Breast cancer research spans epidemiology, molecular biology, clinical trials, and a vast and rapidly growing literature. One challenge is <em>triangulating</em> across these evidence types: when different sources point in the same direction, we can be more confident we are seeing something causal rather than correlational.</p>
<p>In a paper led by Marina Vabistsevits published in the <em><a href="https://doi.org/10.1016/j.jbi.2025.104810" target="_blank" rel="noopener noreferrer" class="">Journal of Biomedical Informatics</a></em>, we show how to bring two complementary sources together:</p>
<ol>
<li class=""><strong>Mendelian randomization (MR)</strong> evidence generated at scale using <strong>MR-EvE (“Everything-vs-Everything”)</strong>, and</li>
<li class=""><strong>Literature-mined relationships</strong> stored in <strong><a href="https://www.epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">EpiGraphDB</a></strong>, our biomedical knowledge graph.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-combine-mr-with-literature-mining">Why combine MR with literature mining?<a href="https://dmer-group.example.com/news/mr-literature-breast-cancer-risk-factors#why-combine-mr-with-literature-mining" class="hash-link" aria-label="Direct link to Why combine MR with literature mining?" title="Direct link to Why combine MR with literature mining?" translate="no">​</a></h2>
<p>MR can help prioritise likely causal risk factors, but it does not automatically tell us <strong>how</strong> an exposure influences disease. Meanwhile, the biomedical literature is full of mechanistic clues—but it is too large to read manually, and individual papers can be hard to weigh.</p>
<p>Our aim was to use <strong>MR for efficient hypothesis generation</strong>, and then use <strong>literature-mined links to suggest plausible intermediates/mediators</strong>, before returning to genetics again for validation.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-did">What we did<a href="https://dmer-group.example.com/news/mr-literature-breast-cancer-risk-factors#what-we-did" class="hash-link" aria-label="Direct link to What we did" title="Direct link to What we did" translate="no">​</a></h2>
<p>We started with MR-EvE estimates to screen many traits against breast cancer outcomes, looking for candidate risk factors and possible mediators. We then integrated these MR results with <strong>literature-mined “triples”</strong> (subject–predicate–object statements extracted from papers) in EpiGraphDB, using an approach based on <strong>overlapping “literature spaces”</strong> between a risk factor trait and breast cancer.</p>
<p>Finally, for literature-based discovery (LBD) candidates, we used <strong>two-step MR</strong> to check whether a proposed intermediate sat on a plausible causal path from risk factor → intermediate → breast cancer.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-found">What we found<a href="https://dmer-group.example.com/news/mr-literature-breast-cancer-risk-factors#what-we-found" class="hash-link" aria-label="Direct link to What we found" title="Direct link to What we found" translate="no">​</a></h2>
<p>Using this pipeline, we identified <strong>129 lifestyle risk factors and molecular traits</strong> with evidence of an effect on breast cancer (including both established and potentially novel signals). We also made the MR results explorable via an <strong>R/Shiny app</strong> for interactive browsing and hypothesis generation.</p>
<p>To show how the integration works in practice, the paper walks through two case studies:</p>
<ul>
<li class=""><strong>Childhood body size</strong>, where combining MR and literature helps explore downstream intermediates that might connect early-life adiposity to later breast cancer risk.</li>
<li class=""><strong>HDL-cholesterol</strong>, where the literature-mined links provide mechanistic hypotheses that can then be followed up using genetics-based mediation checks.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-matters">Why this matters<a href="https://dmer-group.example.com/news/mr-literature-breast-cancer-risk-factors#why-this-matters" class="hash-link" aria-label="Direct link to Why this matters" title="Direct link to Why this matters" translate="no">​</a></h2>
<p>This is not about replacing careful study design or detailed mechanistic work. The point is to make it <strong>easier to navigate the space of plausible hypotheses</strong>, and to prioritise follow-up work with a clearer view of (a) what looks causal and (b) what the literature suggests about potential pathways.</p>
<p>More broadly, it’s a demonstration of what we think knowledge graphs are good at: <strong>connecting evidence across study types</strong> and helping us ask better questions, faster.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="try-it-yourself">Try it yourself<a href="https://dmer-group.example.com/news/mr-literature-breast-cancer-risk-factors#try-it-yourself" class="hash-link" aria-label="Direct link to Try it yourself" title="Direct link to Try it yourself" translate="no">​</a></h2>
<ul>
<li class="">Paper: <em><a href="https://doi.org/10.1016/j.jbi.2025.104810" target="_blank" rel="noopener noreferrer" class="">Integrating Mendelian randomization and literature-mined evidence for breast cancer risk factors</a></em> (Journal of Biomedical Informatics, 2025).</li>
<li class="">Interactive MR heatmaps app: <a href="https://mvab.shinyapps.io/MR_heatmaps/" target="_blank" rel="noopener noreferrer" class="">https://mvab.shinyapps.io/MR_heatmaps/</a></li>
<li class="">EpiGraphDB platform: <a href="https://epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">https://epigraphdb.org</a></li>
</ul>
<hr>
<p><em>If you use the app or EpiGraphDB in your work and have ideas for additional features, do get in touch — we’re always keen to hear how people are using these resources.</em></p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <author>
            <name>with initial draft by AI</name>
        </author>
        <category label="MR" term="MR"/>
        <category label="epigraphdb" term="epigraphdb"/>
        <category label="breast cancer" term="breast cancer"/>
        <category label="knowledge graph" term="knowledge graph"/>
        <category label="literature mining" term="literature mining"/>
        <category label="papers" term="papers"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Dissecting blood pressure and BMI a pathway- and tissue-partitioned Mendelian randomization comparison]]></title>
        <id>https://dmer-group.example.com/news/pathway_partitioned_mr</id>
        <link href="https://dmer-group.example.com/news/pathway_partitioned_mr"/>
        <updated>2025-05-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A Genome Medicine study introduces 'pathway-partitioned' instruments informed by Mendelian disease phenotypes, and compares them to colocalization-based tissue partitioning to reveal distinct drivers of cardiovascular risk.]]></summary>
        <content type="html"><![CDATA[<p><img decoding="async" loading="lazy" alt="Pathway- vs tissue-partitioned MR, simplified schematic." src="https://dmer-group.example.com/assets/images/pathway_partitioned_mr-586bc88b46f672a1d27ec2494a8ed027.png" width="694" height="929" class="img_ev3q"></p>
<p>Complex traits like blood pressure (BP) and body mass index (BMI) are <strong>highly polygenic</strong>: hundreds of associated variants can be used as instruments in Mendelian randomization (MR). But those variants don’t all “mean the same thing” biologically—some may act through kidney physiology, others through vasculature, neurobiology, metabolism, and so on. If we can <em>separate</em> instruments into interpretable biological subsets, we can start asking questions like:</p>
<ul>
<li class=""><em>Which component of BP is most responsible for coronary heart disease risk?</em></li>
<li class=""><em>Are BMI → atrial fibrillation effects more “metabolic” or more “neuro-behavioural”?</em></li>
</ul>
<p>Work led by Genevieve Leyden and Maria Sobczyk and now published in <em><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12079859/" target="_blank" rel="noopener noreferrer" class="">Genome Medicine</a></em> sets out to do exactly this by comparing two ways of <strong>partitioning genetic instruments</strong> before running MR.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="two-complementary-ways-to-split-instruments">Two complementary ways to split instruments<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#two-complementary-ways-to-split-instruments" class="hash-link" aria-label="Direct link to Two complementary ways to split instruments" title="Direct link to Two complementary ways to split instruments" translate="no">​</a></h2>
<p>The paper evaluates two strategies side-by-side:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-pathway-partitioned-instruments-from-mendelian-disease-phenotypes">1) Pathway-partitioned instruments from Mendelian disease phenotypes<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#1-pathway-partitioned-instruments-from-mendelian-disease-phenotypes" class="hash-link" aria-label="Direct link to 1) Pathway-partitioned instruments from Mendelian disease phenotypes" title="Direct link to 1) Pathway-partitioned instruments from Mendelian disease phenotypes" translate="no">​</a></h3>
<p>We introduce a new approach that groups genome-wide significant BP/BMI variants by whether they lie near <strong>Mendelian disease genes</strong> enriched for particular symptom categories (via MendelVar).</p>
<ul>
<li class="">For <strong>blood pressure</strong>, variants are partitioned into <em>renal</em> vs <em>vessel</em> (vasculature) pathways.</li>
<li class="">For <strong>BMI</strong>, variants are partitioned into <em>metabolic</em> vs <em>mental health</em> pathways.</li>
</ul>
<p>The idea is pragmatic: Mendelian disorders provide a clinically grounded “phenotype dictionary” that can help annotate the biology sitting underneath GWAS loci.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-tissue-partitioned-instruments-from-genetic-colocalization">2) Tissue-partitioned instruments from genetic colocalization<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#2-tissue-partitioned-instruments-from-genetic-colocalization" class="hash-link" aria-label="Direct link to 2) Tissue-partitioned instruments from genetic colocalization" title="Direct link to 2) Tissue-partitioned instruments from genetic colocalization" translate="no">​</a></h3>
<p>We compare the above to an existing, colocalization-derived approach previously published by Genevieve that partitions variants based on evidence that a GWAS signal overlaps an <strong>eQTL</strong> signal in specific tissues.</p>
<ul>
<li class="">For <strong>blood pressure</strong>, variants are assigned to <em>kidney</em> vs <em>artery</em> tissues.</li>
<li class="">For <strong>BMI</strong>, variants are assigned to <em>brain</em> vs <em>adipose</em> tissues.</li>
</ul>
<p>These two partitioning schemes are not expected to be identical—one is anchored in <em>clinical phenotype enrichment</em>, the other in <em>molecular regulatory context</em>—but comparing them can highlight where biology is robust versus where interpretation needs caution.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-did-we-find">What did we find?<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#what-did-we-find" class="hash-link" aria-label="Direct link to What did we find?" title="Direct link to What did we find?" translate="no">​</a></h2>
<p>Our headline result is that <strong>different partitions can suggest different “drivers” of the same exposure–outcome relationship</strong>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="blood-pressure-vessel-pathways-stand-out-for-heart-disease-but-tissue-signals-are-more-nuanced">Blood pressure: vessel pathways stand out for heart disease, but tissue signals are more nuanced<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#blood-pressure-vessel-pathways-stand-out-for-heart-disease-but-tissue-signals-are-more-nuanced" class="hash-link" aria-label="Direct link to Blood pressure: vessel pathways stand out for heart disease, but tissue signals are more nuanced" title="Direct link to Blood pressure: vessel pathways stand out for heart disease, but tissue signals are more nuanced" translate="no">​</a></h3>
<p>Using the <em>pathway-partitioned</em> approach, the systolic BP → heart disease effect appeared <strong>more strongly driven by vessel (vasculature) instruments than renal instruments</strong>.</p>
<p>However, in the <em>tissue-partitioned</em> analyses, the corresponding comparison suggested a <strong>stronger effect attributed to kidney than artery tissue</strong>, consistent with BP acting through multiple intertwined mechanisms rather than a single clean partition.</p>
<p>Across outcomes, we also report consistent evidence that:</p>
<ul>
<li class=""><strong>Vessel (pathway)</strong> and <strong>artery (tissue)</strong> instruments dominate the <em>negative</em> directional effect of diastolic BP on left ventricular stroke volume, and the <em>positive</em> directional effect of systolic BP on type 2 diabetes.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="bmi-and-atrial-fibrillation-metabolic-vs-brain-components">BMI and atrial fibrillation: metabolic vs brain components<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#bmi-and-atrial-fibrillation-metabolic-vs-brain-components" class="hash-link" aria-label="Direct link to BMI and atrial fibrillation: metabolic vs brain components" title="Direct link to BMI and atrial fibrillation: metabolic vs brain components" translate="no">​</a></h3>
<p>When focusing on BMI → atrial fibrillation, we found the causal effect was <strong>predominantly driven by</strong>:</p>
<ul>
<li class=""><strong>metabolic-pathway</strong> instruments (relative to mental health pathway instruments), and</li>
<li class=""><strong>brain-tissue</strong> instruments (relative to adipose tissue instruments).</li>
</ul>
<p>That contrast is interesting in itself: it suggests that “metabolic” and “brain” are not mutually exclusive stories, and may instead reflect overlapping causal routes (e.g. appetite regulation and energy balance).</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-matters-for-mr-interpretation">Why this matters for MR interpretation<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#why-this-matters-for-mr-interpretation" class="hash-link" aria-label="Direct link to Why this matters for MR interpretation" title="Direct link to Why this matters for MR interpretation" translate="no">​</a></h2>
<p>A common workflow in MR is to build a single instrument set and estimate an “average” causal effect. This paper is a reminder that, for many complex traits, that average may be combining multiple biological components.</p>
<p>Partitioning instruments can help us:</p>
<ul>
<li class=""><strong>identify potential pleiotropic pathways</strong> (e.g. if only one partition shows an effect),</li>
<li class=""><strong>generate mechanism-focused hypotheses</strong> (which component of BP/BMI is most relevant?),</li>
<li class=""><strong>prioritize follow-up</strong> (e.g. deeper locus-to-gene work in the partition driving an effect), and</li>
<li class=""><strong>spot interpretability pitfalls</strong> when different annotation schemes tell different stories.</li>
</ul>
<p>However, we emphasize that <strong>partitioned effect-size differences need robust validation</strong> (and can arise by chance without strong supporting evidence), so this should be viewed as a <em>hypothesis-generating</em> framework rather than a final answer.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="our-take">Our take<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#our-take" class="hash-link" aria-label="Direct link to Our take" title="Direct link to Our take" translate="no">​</a></h2>
<p>From our perspective, this is a neat example of how we can bring in <em>external biological knowledge</em>—here, Mendelian disease phenotype enrichment—to get more out of standard GWAS-derived instruments, and to complement more “molecular” approaches like colocalization.</p>
<p>If you are doing MR on a complex exposure and you suspect heterogeneous mechanisms, it’s worth thinking about <strong>instrument partitioning</strong> early: it can make your results easier to interpret, and it can help decide what’s worth chasing downstream.</p>
<hr>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="paper">Paper<a href="https://dmer-group.example.com/news/pathway_partitioned_mr#paper" class="hash-link" aria-label="Direct link to Paper" title="Direct link to Paper" translate="no">​</a></h3>
<p>Leyden GM, Sobczyk MK, Richardson TG, Gaunt TR. <em>Distinct pathway-based effects of blood pressure and body mass index on cardiovascular traits: comparison of novel Mendelian randomization approaches.</em> <strong>Genome Medicine</strong> (2025) 17:54. DOI: 10.1186/s13073-025-01472-2. PubMed: <a href="https://pubmed.ncbi.nlm.nih.gov/40375348/" target="_blank" rel="noopener noreferrer" class="">https://pubmed.ncbi.nlm.nih.gov/40375348/</a>. Full text: <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12079859/" target="_blank" rel="noopener noreferrer" class="">https://pmc.ncbi.nlm.nih.gov/articles/PMC12079859/</a></p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <author>
            <name>with initial draft by AI</name>
        </author>
        <category label="MR" term="MR"/>
        <category label="blood pressure" term="blood pressure"/>
        <category label="BMI" term="BMI"/>
        <category label="cardiometabolic" term="cardiometabolic"/>
        <category label="MendelVar" term="MendelVar"/>
        <category label="colocalization" term="colocalization"/>
        <category label="genetic epidemiology" term="genetic epidemiology"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[CanDrivR-CS: cancer-specific machine learning to separate recurrent from rare missense variants]]></title>
        <id>https://dmer-group.example.com/news/candrivr_cs</id>
        <link href="https://dmer-group.example.com/news/candrivr_cs"/>
        <updated>2024-09-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A new gradient-boosting framework (CanDrivR-CS) builds cancer-type-specific models to distinguish recurrent from rare somatic missense variants, highlighting a role for DNA-shape features.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/candrivr_cs#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>Cancer genomes contain huge numbers of mutations, but only a subset are functionally important. One simple clue is <strong>recurrence</strong>: if the same missense variant shows up repeatedly across patients with the same cancer type, that can suggest positive selection for growth advantage. At the same time, <strong>rare</strong> variants can still matter (for example, if they emerge under treatment as resistance mechanisms).</p>
<p>In work led by Amy Francis, we introduce <strong><a href="https://www.biorxiv.org/content/10.1101/2024.09.19.613896v1" target="_blank" rel="noopener noreferrer" class="">CanDrivR-CS</a></strong>, a framework that trains <strong>cancer-type-specific</strong> machine-learning models to distinguish <em>recurrent</em> from <em>rare</em> somatic missense variants. It’s a useful reminder that “one-size-fits-all” predictors can miss disease-context signals, and that relatively interpretable models can still surface mechanistic hypotheses.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-did">What we did<a href="https://dmer-group.example.com/news/candrivr_cs#what-we-did" class="hash-link" aria-label="Direct link to What we did" title="Direct link to What we did" translate="no">​</a></h2>
<p>We curated missense variant data from the <strong>International Cancer Genome Consortium (ICGC)</strong> and trained a suite of <strong>gradient boosting</strong> classifiers, one per cancer type, alongside a baseline <strong>pan-cancer</strong> model. The goal was not to label variants as “pathogenic” in the clinical sense, but to learn patterns that separate variants from two cancer-relevant frequency regimes: those that recur across samples versus those that appear rarely.</p>
<p>A practical detail was our  evaluation setup: we report <strong>leave-one-group-out cross-validation (LOGO-CV)</strong>, which is designed to test generalisation when a meaningful group (e.g. a gene or cohort) is held out at training time.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-results">Key results<a href="https://dmer-group.example.com/news/candrivr_cs#key-results" class="hash-link" aria-label="Direct link to Key results" title="Direct link to Key results" translate="no">​</a></h2>
<ul>
<li class=""><strong>Cancer-type-specific models outperformed the pan-cancer baseline</strong>, with LOGO-CV F1 scores reaching <strong>0.90</strong> for skin cutaneous melanoma (CanDrivR-SKCM) and <strong>0.89</strong> for skin adenocarcinoma (CanDrivR-SKCA), versus <strong>0.792</strong> for the baseline model.</li>
<li class=""><strong>DNA-shape properties</strong> consistently ranked among the most informative features across cancer types. We report that recurrent missense variants were enriched in regions associated with <strong>DNA bends and rolls</strong>, raising the possibility that local structural context contributes to mutational hotspots (for example via replication or repair dynamics).</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-matters">Why this matters<a href="https://dmer-group.example.com/news/candrivr_cs#why-this-matters" class="hash-link" aria-label="Direct link to Why this matters" title="Direct link to Why this matters" translate="no">​</a></h2>
<p>From a translational perspective, separating “common” from “rare” somatic variants is not the whole driver/passenger story — but it is a useful lens:</p>
<ul>
<li class="">It can help prioritise variants for follow-up in <strong>cancer-type-specific</strong> settings (where selection pressures differ).</li>
<li class="">It provides an interpretable way to test whether adding new feature classes (like <strong>DNA-shape</strong>) improves discrimination.</li>
<li class="">It highlights the value of open, reusable pipelines for variant feature engineering and modelling.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="resources">Resources<a href="https://dmer-group.example.com/news/candrivr_cs#resources" class="hash-link" aria-label="Direct link to Resources" title="Direct link to Resources" translate="no">​</a></h2>
<ul>
<li class="">Preprint: <a href="https://www.biorxiv.org/content/10.1101/2024.09.19.613896v1" target="_blank" rel="noopener noreferrer" class="">https://www.biorxiv.org/content/10.1101/2024.09.19.613896v1</a></li>
<li class="">Code and data: <a href="https://github.com/amyfrancis97/CanDrivR-CS" target="_blank" rel="noopener noreferrer" class="">https://github.com/amyfrancis97/CanDrivR-CS</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="paper">Paper<a href="https://dmer-group.example.com/news/candrivr_cs#paper" class="hash-link" aria-label="Direct link to Paper" title="Direct link to Paper" translate="no">​</a></h2>
<p>Francis A, Campbell C, Gaunt T. <em>CanDrivR-CS: A Cancer-Specific Machine Learning Framework for Distinguishing Recurrent and Rare Variants.</em> bioRxiv (posted Sep 23, 2024). <a href="https://www.biorxiv.org/content/10.1101/2024.09.19.613896v1" target="_blank" rel="noopener noreferrer" class="">https://www.biorxiv.org/content/10.1101/2024.09.19.613896v1</a></p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <author>
            <name>with initial draft by AI</name>
        </author>
        <category label="cancer genomics" term="cancer genomics"/>
        <category label="machine learning" term="machine learning"/>
        <category label="variant effect prediction" term="variant effect prediction"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[DrivR-Base: a feature extraction toolkit for variant effect prediction]]></title>
        <id>https://dmer-group.example.com/news/drivr-base</id>
        <link href="https://dmer-group.example.com/news/drivr-base"/>
        <updated>2024-04-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A Dockerised toolkit that pulls together conservation, regulatory, protein-structure and other annotations into ready-to-model feature tables for single nucleotide variants.]]></summary>
        <content type="html"><![CDATA[<p>Understanding which genetic variants are likely to be <em>functional</em> (and which are probably benign) is a cornerstone of modern human genetics. Over the last decade, variant-effect predictors have become increasingly sophisticated — but behind every model sits the same practical headache: <strong>assembling a sensible set of features</strong> (annotations) for millions of variants from dozens of databases.</p>
<p>In a 2024 <em><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC11057939/" target="_blank" rel="noopener noreferrer" class="">Bioinformatics</a></em> paper led by Amy Francis, we introduce <strong>DrivR-Base</strong>, a reproducible, Dockerised toolkit that turns this feature-extraction step into something you can run and re-run with far less pain.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-problem-is-drivr-base-trying-to-solve">What problem is DrivR-Base trying to solve?<a href="https://dmer-group.example.com/news/drivr-base#what-problem-is-drivr-base-trying-to-solve" class="hash-link" aria-label="Direct link to What problem is DrivR-Base trying to solve?" title="Direct link to What problem is DrivR-Base trying to solve?" translate="no">​</a></h2>
<p>Most variant-effect prediction methods are “integrative”: they combine signals about a variant’s genomic context (e.g. conservation), regulatory annotations (e.g. ENCODE peaks), and protein-level consequences (e.g. amino-acid change, structure). The data exist — but pulling them together is often:</p>
<ul>
<li class=""><strong>time-consuming</strong> (lots of sources, formats, and edge cases),</li>
<li class=""><strong>hard to reproduce</strong> (different software versions and dependencies), and</li>
<li class=""><strong>risky</strong> (you can spend weeks extracting features that later turn out not to help your model).</li>
</ul>
<p>DrivR-Base’s core idea is simple: provide a <strong>single, consistent pipeline</strong> that extracts a broad set of annotations for <em>all possible SNVs</em> in GRCh38, so we can spend more time modelling and less time wrangling.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-drivr-base">What is DrivR-Base?<a href="https://dmer-group.example.com/news/drivr-base#what-is-drivr-base" class="hash-link" aria-label="Direct link to What is DrivR-Base?" title="Direct link to What is DrivR-Base?" translate="no">​</a></h2>
<p>DrivR-Base is a <strong>feature extraction toolkit</strong> for human single nucleotide variants (SNVs) in the GRCh38 genome build. It produces a table where each row is a variant and columns are feature values drawn from multiple sources (genome- and protein-level). It’s packaged for <strong>Docker</strong>, which helps make installs and runs repeatable across machines and over time.</p>
<p>The paper highlights a few motivating use-cases beyond “classic” pathogenicity prediction, including haploinsufficiency prediction and feature sets that could feed into drug repurposing workflows.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-features-does-it-extract">What features does it extract?<a href="https://dmer-group.example.com/news/drivr-base#what-features-does-it-extract" class="hash-link" aria-label="Direct link to What features does it extract?" title="Direct link to What features does it extract?" translate="no">​</a></h2>
<p>DrivR-Base groups its outputs into <strong>ten feature groups</strong>, spanning sequence context, regulatory genomics, and protein structure.</p>
<ol>
<li class="">
<p><strong>Conservation and mappability</strong><br>
<!-- -->PhyloP/PhastCons conservation scores across multiple alignments, plus Umap/Bismap mappability (useful for flagging regions prone to sequencing ambiguity).</p>
</li>
<li class="">
<p><strong>Variant Effect Predictor (VEP) annotations</strong><br>
<!-- -->Transcript consequences (one-hot encoded), predicted amino acids (wild-type vs mutant), and distances to transcripts when multiple are affected.</p>
</li>
<li class="">
<p><strong>Dinucleotide properties (DiProDB)</strong><br>
<!-- -->Thermodynamic and conformational properties for dinucleotide contexts around the variant, captured under wild-type and mutant configurations.</p>
</li>
<li class="">
<p><strong>DNA shape (DNAShapeR)</strong><br>
<!-- -->Local structural properties like minor groove width, helix twist, propeller twist, roll, and electrostatic potential in a configurable window around the SNV.</p>
</li>
<li class="">
<p><strong>GC content and CpG metrics</strong><br>
<!-- -->GC fraction, CpG counts, and observed/expected CpG across multiple window sizes.</p>
</li>
<li class="">
<p><strong>Kernel-based sequence similarity (spectrum kernels)</strong><br>
<!-- -->K-mer based comparisons between wild-type and mutant sequence windows as a compact way to encode “sequence disruption”.</p>
</li>
<li class="">
<p><strong>Amino-acid substitution matrices</strong><br>
<!-- -->Substitution rates from common matrices (e.g. BLOSUM, PAM, JTT variants) for non-synonymous variants.</p>
</li>
<li class="">
<p><strong>Amino-acid properties</strong><br>
<!-- -->Hundreds of amino-acid descriptors (e.g. hydrophobicity, polarity, flexibility) for wild-type and mutant residues.</p>
</li>
<li class="">
<p><strong>ENCODE-derived regulatory features</strong><br>
<!-- -->Peaks and signal summaries across multiple assay types (TF ChIP-seq, histone marks, DNase/ATAC, eCLIP, etc.). Note: the authors report this step can require substantial local storage (on the order of ~160GB) because it downloads large ENCODE datasets.</p>
</li>
<li class="">
<p><strong>Protein structure features from AlphaFold (and PDB)</strong><br>
<!-- -->For coding variants, mapping to protein positions enables extraction of AlphaFold structural information (e.g. atom coordinates and conformation-type encodings).</p>
</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-matters-for-our-work">Why this matters for our work<a href="https://dmer-group.example.com/news/drivr-base#why-this-matters-for-our-work" class="hash-link" aria-label="Direct link to Why this matters for our work" title="Direct link to Why this matters for our work" translate="no">​</a></h2>
<p>A lot of what we do in the DMER team sits at the interface of <strong>genetic evidence and downstream biology</strong> — and variant-level annotations are often the glue. Even when our end goal isn’t “variant pathogenicity prediction”, having a robust, standardised way to pull out features can help with:</p>
<ul>
<li class="">building or benchmarking new predictors (and understanding <em>why</em> they behave as they do),</li>
<li class="">prioritising variants for experimental follow-up, and</li>
<li class="">reusing the same feature definitions across projects to avoid “feature drift”.</li>
</ul>
<p>Just as importantly, DrivR-Base makes it easier to <strong>ask the boring-but-essential questions early</strong>, like: <em>Which feature groups are actually informative for my prediction task?</em> That can save a lot of iteration time.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="getting-started">Getting started<a href="https://dmer-group.example.com/news/drivr-base#getting-started" class="hash-link" aria-label="Direct link to Getting started" title="Direct link to Getting started" translate="no">​</a></h2>
<p>DrivR-Base is distributed via GitHub with Docker instructions. The paper and repository are the best places to start:</p>
<ul>
<li class="">Paper (open access via PubMed Central): <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC11057939/" target="_blank" rel="noopener noreferrer" class="">https://pmc.ncbi.nlm.nih.gov/articles/PMC11057939/</a></li>
<li class="">Code: <a href="https://github.com/amyfrancis97/DrivR-Base" target="_blank" rel="noopener noreferrer" class="">https://github.com/amyfrancis97/DrivR-Base</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reference">Reference<a href="https://dmer-group.example.com/news/drivr-base#reference" class="hash-link" aria-label="Direct link to Reference" title="Direct link to Reference" translate="no">​</a></h2>
<p>Francis A, Campbell C, Gaunt TR. <strong>DrivR-Base: a feature extraction toolkit for variant effect prediction model construction.</strong> <em>Bioinformatics</em> (2024).</p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <author>
            <name>with initial draft by AI</name>
        </author>
        <category label="software" term="software"/>
        <category label="genetics" term="genetics"/>
        <category label="machine learning" term="machine learning"/>
        <category label="variant effect prediction" term="variant effect prediction"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Pilot analysis on BioRxiv and MedRxiv full text data to facilitate comprehensive data mining on biomedical literature]]></title>
        <id>https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project</id>
        <link href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project"/>
        <updated>2023-08-21T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We have curated the full text data archives of BioRxiv and MedRxiv preprints and conducted some exploratory analyses for our next stage research projects in text mining biomedical literature with automated approaches.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>The <a href="https://biorxiv.org/" target="_blank" rel="noopener noreferrer" class="">BioRxiv</a> and <a href="https://medrxiv.org/" target="_blank" rel="noopener noreferrer" class="">MedRxiv</a> preprint facilities are vital infrastructure for the biomedical research community, which also provide a rich and comprehensive resource for data mining biomedical literature for investigations on research trends, interests, and novel findings. In our previous works we have conducted extensive literature mining efforts on BioRxiv and MedRxiv to extract structural literature knowledge into <a href="https://epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">EpiGraphDB</a>[1] and derive research claims from recent preprints to be triangulated with other evidence on <a href="https://asq.epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">ASQ</a>[2].</p>
<p>Supported by the <a href="https://www.bristol.ac.uk/blackwell/" target="_blank" rel="noopener noreferrer" class="">Elizabeth Blackwell Institute</a> Rapid Research Funding Call, in this project we have acquired the full text data archives for BioRxiv and MedRxiv preprints from 2013 to May 2023, and we have also conducted some exploratory analysis on the data archives.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="full-text-archives-of-biorxiv--medrxiv-preprints">Full text archives of BioRxiv / MedRxiv preprints<a href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project#full-text-archives-of-biorxiv--medrxiv-preprints" class="hash-link" aria-label="Direct link to Full text archives of BioRxiv / MedRxiv preprints" title="Direct link to Full text archives of BioRxiv / MedRxiv preprints" translate="no">​</a></h2>
<p>Web scraping pre-print text is time consuming as well as error prone. However, BioRxiv and MedRxiv have provided archive data on the preprints for the purpose of <a href="https://www.biorxiv.org/tdm" target="_blank" rel="noopener noreferrer" class="">text and data mining</a> hosted on Amazon AWS S3 as Requester Pays Buckets. Full text archives from BioRxiv and MedRxiv are stored in two S3 buckets <code>biorxiv-src-monthly</code> and <code>medrxiv-src-monthly</code> respectively, where the organization structure is roughly as follows (high level description is also available on the <a href="https://www.biorxiv.org/tdm" target="_blank" rel="noopener noreferrer" class="">tdm pages</a>):</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">    biorxiv-src-montly</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ├── Back_Content</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    └── Current_Content</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── April_2019</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── April_2020</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        │&nbsp;&nbsp; ├── 0002415e-6e79-1014-bad3-d7b11ff8718c.meca</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        │&nbsp;&nbsp; ├── 0008729e-7222-1014-9e12-b08e9cbb4568.meca</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        │&nbsp;&nbsp; ├── 00114dfb-6f21-1014-b58f-80aa2e0e89bd.meca</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        │&nbsp;&nbsp; ├── 00114e62-6cf9-1014-aedd-beed0b185e0b.meca</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        │&nbsp;&nbsp; ├── 0025be66-6dea-1014-8d12-849fadd63f55.meca</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        │&nbsp;&nbsp; ├── ....</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        │&nbsp;&nbsp; ├── ....</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── April_2021</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── April_2022</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── April_2023</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── August_2019</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── August_2020</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── August_2021</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── August_2022</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── December_2018</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── December_2019</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── December_2020</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── December_2021</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── December_2022</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── ...</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        ├── ...</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    medrxiv-src-monthly</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ├── Back_Content</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    └── Current_Content</span><br></span></code></pre></div></div>
<p>Each <code>.meca</code> file is an archive in <code>.zip</code> format which contains individual files associated with one preprint submission. For example the following structure corresponds to files associated with this <a href="https://www.medrxiv.org/content/10.1101/2022.09.07.22279527v1" target="_blank" rel="noopener noreferrer" class="">preprint (Shrestha et al., 2022, Knowledge, Attitude and Practice (KAP) study on COVID-19 among the general population of Nepal)</a>. Specifically the file <code>./content/22279527.xml</code> is the “full text / manuscript” file in <code>.xml</code> format.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">    0a2ef310-6c04-1014-8ee5-ac250845df11.meca</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ├── content</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527.pdf</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl1.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl2.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl3.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl4.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl5a.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl5.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl6a.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl6b.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; ├── 22279527v1_tbl6.tif</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    │&nbsp;&nbsp; └── 22279527.xml</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ├── directives.xml</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ├── manifest.xml</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    ├── mimetype</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    └── transfer.xml</span><br></span></code></pre></div></div>
<p>How do we know which <code>.meca</code> archive filename corresponds to which preprint DOI? Unfortunately we haven’t found a way to know this without actually opening the meca file and parsing the full text xml. In addition we have identified several rounds of changes to the storage structure, as well as text format (e.g.&nbsp;how a key information such as submission date is represented in xml tags), which means that it takes effort to systematically and robustly curate various information such as metadata as well as sections (e.g.&nbsp;abstracts, methods, results, conclusions) in the manuscripts. We will need to curate these information as separate intermediate datasets for our follow-up research projects.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="some-exploratory-results">Some exploratory results<a href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project#some-exploratory-results" class="hash-link" aria-label="Direct link to Some exploratory results" title="Direct link to Some exploratory results" translate="no">​</a></h2>
<p>As part of our on-going efforts in parsing and curating the preprint information from the raw archives, here we discuss some of the gathered results which will share insights into future projects.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="trends-and-distributions">Trends and distributions<a href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project#trends-and-distributions" class="hash-link" aria-label="Direct link to Trends and distributions" title="Direct link to Trends and distributions" translate="no">​</a></h4>
<p>Figures 1-3 shows the trends and distributions of the preprints. MedRxiv was separated from BioRxiv in 2019 to host preprints related to clinical and medical topics, and they have now become vital infrastructure to the scientific community in communicating rapid findings. The COVID-19 pandemic also substantially contributed to this growth in researchers presenting their findings as preprints before the peer-review process is complete. Bioinformatics, Cancer Biology, Cell Biology, Evolutionary Biology, Microbiology, as well as Neuroscience are among the most populous categories in BioRxiv, whereas Epidemiology, Infectious Diseases (except HIV/AIDS), Public and Global Health are among the most populous ones in MedRxiv.</p>
<p><img decoding="async" loading="lazy" src="https://dmer-group.example.com/assets/images/preprints-count-total-b29998d7ab18e74de410f5154412282e.png" width="995" height="447" class="img_ev3q"> <strong>Figure 1: Total number of monthly preprint submissions</strong> We count the archives by their submission month and preprint source (biorxiv / medrxiv). The red line corresponds to the combined submissions from both sources. The black dashed line corresponds to February 2020 when COVID-19 started to a pandemic, and the red dash line corresponds to May 2020 around which the monthly submissions peaked.</p>
<p><img decoding="async" loading="lazy" src="https://dmer-group.example.com/assets/images/preprints-count-biorxiv-2cdd6c3d366db70e32c63ae45816b390.png" width="995" height="447" class="img_ev3q"> <strong>Figure 2: Total number of montly BioRxiv submissions by category</strong> We count the archives by their submission month and category, where we consolidate the monthly categories below top-15 most populous ones into an “Others” category.</p>
<p><img decoding="async" loading="lazy" src="https://dmer-group.example.com/assets/images/preprints-count-medrxiv-a9b4f6406d9e84790235d0295a593f28.png" width="1049" height="447" class="img_ev3q"> <strong>Figure 3: Total number of montly MedRxiv submissions by category</strong> We count the archives by their submission month and category, where we consolidate the monthly categories below top-15 most populous ones into an “Others” category.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="topic-analysis">Topic analysis<a href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project#topic-analysis" class="hash-link" aria-label="Direct link to Topic analysis" title="Direct link to Topic analysis" translate="no">​</a></h4>
<p>To further demonstrate the distribution of research topics, we did some simple topic analysis, where we used the BERTopic library to efficiently embed and model all (over 300,000) article titles to derive topics corresponding to research topics. Figure 4 shows the trends of the dynamic topics, where we visually identify a topic related to the transmission of SARS-CoV-2 virus (“sarscov2”, “transmission”, “seroprevalence”, etc.) dominate over others during the early phase (2020-2021) of the pandemic and its gradual decline. In addition, a topic related to vaccines (“vaccination”, “vaccine hesitancy”, etc.) can also been seen being prevalent during 2021. Figure 5 then further demonstrates how each topic relate to others in the topic clusters.</p>
<p><img decoding="async" loading="lazy" src="https://dmer-group.example.com/assets/images/topics_over_time-5322de7e4e90303b9a3698f52b2c2e3c.png" width="1250" height="450" class="img_ev3q"> <strong>Figure 4: Trends of topics dynamic topic modelling analysis</strong> Dynamic topics are modelled by associating the submission month of the preprint with the topics represented from the article titles.</p>
<p><img decoding="async" loading="lazy" src="https://dmer-group.example.com/assets/images/topics-8281535dab8bc4fc3a95039e2824b464.png" width="1800" height="1800" class="img_ev3q"> <strong>Figure 5: Topic clusters</strong> We randomly select a sample of 100_000 documents (article titles) to generate the topic model, and then used the UMAP clustering method to visualize the clustering of documents. For all the generated topics, we highlight those associated with the COVID-19 pandemic, as well as for comparison a few others related to health data science and genetic epidemiology topics.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="summary">Summary<a href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project#summary" class="hash-link" aria-label="Direct link to Summary" title="Direct link to Summary" translate="no">​</a></h2>
<p>The above preliminary results on the exploratory analysis demonstrate the richness in what can be data mined from analysing the BioRxiv / MedRxiv preprints. We are excited at the prospect of research projects on text mining biomedical studies involving the full historical archive of biomedical research (as proxied by the preprints), not just on public health and epidemiology topics, but also cross-discipline ones such as social-economic factors influencing the eventual publication of a preprint, effective modelling of research trends via various network or clustering analysis methods. For potential collaboration interests please reach out to Yi Liu (yi6240.liu[at]bristol.ac.uk) and Tom Gaunt (tom.gaunt[at]bristol.ac.uk).</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="code-availability">Code availability<a href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project#code-availability" class="hash-link" aria-label="Direct link to Code availability" title="Direct link to Code availability" translate="no">​</a></h2>
<p>Our code in data accessing as well as research analysis have been published as a GitHub repository here <a href="https://github.com/MRCIEU/biorxiv-medrxiv-tdm" target="_blank" rel="noopener noreferrer" class="">https://github.com/MRCIEU/biorxiv-medrxiv-tdm</a>, which we will continue to update in the future as our current results are just a first step towards bigger analysis projects.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references">References<a href="https://dmer-group.example.com/news/2023-08-21-ebi-rapid-research-funding-call-project#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<p>[1] Yi Liu, Benjamin Elsworth, et al., Tom Gaunt, EpiGraphDB: a database and data mining platform for health data science, Bioinformatics, Volume 37, Issue 9, May 2021, Pages 1304–1311, <a href="https://doi.org/10.1093/bioinformatics/btaa961" target="_blank" rel="noopener noreferrer" class="">https://doi.org/10.1093/bioinformatics/btaa961</a></p>
<p>[2] Yi Liu, Tom R Gaunt, Triangulating evidence in health sciences with Annotated Semantic Queries medRxiv 2022.04.12.22273803; doi: <a href="https://doi.org/10.1101/2022.04.12.22273803" target="_blank" rel="noopener noreferrer" class="">https://doi.org/10.1101/2022.04.12.22273803</a></p>]]></content>
        <author>
            <name>Yi Liu</name>
        </author>
        <category label="seedcorn funding" term="seedcorn funding"/>
        <category label="text mining" term="text mining"/>
        <category label="NLP" term="NLP"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Proteome-wide Mendelian randomization in global biobank to identify multi-ancestry drug targets]]></title>
        <id>https://dmer-group.example.com/news/gbmi_pqtl</id>
        <link href="https://dmer-group.example.com/news/gbmi_pqtl"/>
        <updated>2022-11-01T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[PhD student Huiling Zhao and co-supervisor Dr Jie (Chris) Zheng published an interesting cross-ancestry MR analysis of potential drug targets in collaboration with the Global Biobank Meta-analysis Initiative.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/gbmi_pqtl#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>Genetic studies have been very biased towards populations of European ancestry in western Europe and the United States of America, and this has led to a significant bias in the application of Mendelian randomization (MR) to identify intervention targets. In this project we worked with a leading international genetics consortium, the <a href="https://www.globalbiobankmeta.org/" target="_blank" rel="noopener noreferrer" class="">Global Biobank Meta-analysis Initiative (GBMI)</a> to evaluate the differences in predicted drug target effects between African and European ancestry populations.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-did">What we did<a href="https://dmer-group.example.com/news/gbmi_pqtl#what-we-did" class="hash-link" aria-label="Direct link to What we did" title="Direct link to What we did" translate="no">​</a></h2>
<p>In this paper (published in <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9646482/" target="_blank" rel="noopener noreferrer" class="">Cell Genomics</a>), Huiling and Chris carried out a multi-ancestry proteome-wide MR analysis using cross-population data from GBMI. We estimated the causal effects of 1545 proteins on eight diseases in African (n=32,658) and European (n=1,219,993) ancestry populations. We found 45 protein-disease pairs with MR and colocalization evidence of causality in European ancestry, and 7 protein-disease pairs with evidence of causality in African ancestry. The difference in sample size (and, consequently, statistical power) almost certainly explains the large difference in number of causal effects detected. Interestingly, only 2 protein-disease pairs showed MR evidence in both ancestries.</p>
<figure><img src="https://dmer-group.example.com/img/gbmi_pqtlMR.jpg" alt="Graphical abstract"><figcaption aria-hidden="true">Graphical abstract</figcaption></figure>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="paper">Paper<a href="https://dmer-group.example.com/news/gbmi_pqtl#paper" class="hash-link" aria-label="Direct link to Paper" title="Direct link to Paper" translate="no">​</a></h2>
<p>‘Proteome-wide Mendelian randomization in global biobank meta-analysis reveals multi-ancestry drug targets for common diseases’ by Huiling Zhao, Humaria Rasheed, Therese Haugdahl Nøst, Yoonsu Cho, Yi Liu, Laxmi Bhatta, Arjun Bhattacharya; Global Biobank Meta-analysis Initiative; Gibran Hemani, George Davey Smith, Ben Michael Brumpton, Wei Zhou, Benjamin M Neale, Tom R Gaunt and Jie Zheng in <a href="https://dx.doi.org/10.1016/j.xgen.2022.100195" target="_blank" rel="noopener noreferrer" class="">Cell Genom. 2022 Nov 9;2(11)<!-- -->:None<!-- -->. doi: 10.1016/j.xgen.2022.100195.</a>.</p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="drug-targets" term="drug-targets"/>
        <category label="mr" term="mr"/>
        <category label="colocalization" term="colocalization"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[New funding: NIHR Bristol Biomedical Research Institute.]]></title>
        <id>https://dmer-group.example.com/news/nihr_brc_2022award</id>
        <link href="https://dmer-group.example.com/news/nihr_brc_2022award"/>
        <updated>2022-10-14T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[The National Institute for Health and Care Research Bristol Biomedical Research Centre (NIHR Bristol BRC) has been awarded nearly £12 million of new funding for the next five years. The DMER programme is linked to the Translational Data Science theme of the BRC, providing a mechanism for translation of our methodological research, software tools and data resources.]]></summary>
        <content type="html"><![CDATA[<blockquote>
<p>“The National Institute for Health and Care Research Bristol Biomedical Research Centre (NIHR Bristol BRC) has been awarded nearly £12 million of new funding for the next five years. The funding has been awarded to University Hospitals Bristol and Weston NHS Foundation Trust by the NIHR, with the University of Bristol a major partner.” - <a href="https://www.bristol.ac.uk/news/2022/october/brc-award.html" target="_blank" rel="noopener noreferrer" class="">Press release</a></p>
</blockquote>
<p>See the <a href="https://www.bristolbrc.nihr.ac.uk/" target="_blank" rel="noopener noreferrer" class="">NIHR Bristol Biomedical Research Centre website</a> for more details on the overall BRC research portfolio.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="links-to-the-data-mining-epidemiological-relationships-programme">Links to the Data Mining Epidemiological Relationships programme<a href="https://dmer-group.example.com/news/nihr_brc_2022award#links-to-the-data-mining-epidemiological-relationships-programme" class="hash-link" aria-label="Direct link to Links to the Data Mining Epidemiological Relationships programme" title="Direct link to Links to the Data Mining Epidemiological Relationships programme" translate="no">​</a></h4>
<p>The NIHR BRC <a href="https://www.bristolbrc.nihr.ac.uk/research/translational-data-science/" target="_blank" rel="noopener noreferrer" class="">Translational Data Science theme</a> is co-led by Profs Tom Gaunt and John Macleod. The theme aims to translate research in the MRC IEU, in particular in the following two workstreams:</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="prioritizing-interventions">Prioritizing interventions<a href="https://dmer-group.example.com/news/nihr_brc_2022award#prioritizing-interventions" class="hash-link" aria-label="Direct link to Prioritizing interventions" title="Direct link to Prioritizing interventions" translate="no">​</a></h6>
<p><a href="https://www.bristolbrc.nihr.ac.uk/research/translational-data-science/genetic-evidence-to-prioritise-intervention/" target="_blank" rel="noopener noreferrer" class="">The first workstream will use genetic evidence to prioritise interventions</a> building on some of our work in the use of molecular QTL Mendelian randomization and genetic colocalization and our collaborations with pharmaceutical partners. This theme will be co-led by Lavinia Paternoster and Gibran Hemani, with Tom Gaunt, Kate Tilling and George Davey Smith.</p>
<p>Mendelian randomization (MR) is a ground-breaking gene-based approach pioneered in Bristol by our Medical Director George Davey Smith. This approach doesn’t involve giving people a particular treatment. Instead, it uses natural variation in our genes to test the effects of a modifiable factor to estimate the effect of that factor on disease outcomes. It also allows us to explore how different populations are affected using existing datasets from around the world.</p>
<p>MR is now routinely used to decide which targets to focus on for medical and public health intervention. However, it has mainly been used for disease prevention rather than treatment. To address this, we will apply our new MR methods to genetic datasets to identify potential treatment targets.</p>
<p>The use of MR has also mainly focused on white European populations. We will work with our large population-based study collaborators, including Global Biobank Meta-analysis Initiative and Born in Bradford, to address this. This will allow us to predict ancestry-specific effects for existing and new drugs, and to prioritise interventions for a range of ethnic groups.</p>
<p>We are working with our other themes, including mental health and diet and physical activity, to apply our MR approaches in their research.</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="omics-for-prediction-and-prognosis">Omics for prediction and prognosis<a href="https://dmer-group.example.com/news/nihr_brc_2022award#omics-for-prediction-and-prognosis" class="hash-link" aria-label="Direct link to Omics for prediction and prognosis" title="Direct link to Omics for prediction and prognosis" translate="no">​</a></h6>
<p><a href="https://www.bristolbrc.nihr.ac.uk/research/translational-data-science/omics-for-prediction-and-prognosis/" target="_blank" rel="noopener noreferrer" class="">The second workstream will use omics for prediction and prognosis</a> building on some of Caroline Relton’s work in the MRC IEU on molecular epidemiology, and in particular use of epigenetics for prediction. The workstream will be co-led by Paul Yousefi, Mattew Suderman and Caroline Relton.</p>
<p>In this workstream we use large, complex molecular (‘omics’) datasets to identify biomarkers to predict who will get a disease and how it will progress.</p>
<p>We use machine learning to identify, optimise and validate these molecular biomarkers. We then combine them with data from health records, cohort studies and trials to develop disease prediction tools for use in a range of settings.</p>
<p>Our biomarker identification work will support other NIHR Bristol BRC themes, including respiratory and mental health.</p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="drug-targets" term="drug-targets"/>
        <category label="mr" term="mr"/>
        <category label="funding" term="funding"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Triangulating evidence in health sciences with Annotated Semantic Queries]]></title>
        <id>https://dmer-group.example.com/news/asq_preprint</id>
        <link href="https://dmer-group.example.com/news/asq_preprint"/>
        <updated>2022-04-16T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Integrating information from data sources representing different study designs has the potential to strengthen evidence in population health research. In this work, led by Yi Liu, we present ASQ (Annotated Semantic Queries), a natural language query interface to EpiGraphDB, which enables users to annotate “claims” from a piece of unstructured text with evidence relevant to the claim. ]]></summary>
        <content type="html"><![CDATA[<p>Update: The ASQ work has now been published in <a href="https://doi.org/10.1093/bioinformatics/btae519" target="_blank" rel="noopener noreferrer" class="">Bioinformatics</a>.</p>
<blockquote>
<p>Yi Liu, Tom R Gaunt, Triangulating evidence in health sciences with Annotated Semantic Queries, Bioinformatics, Volume 40, Issue 9, September 2024, btae519, <a href="https://doi.org/10.1093/bioinformatics/btae519" target="_blank" rel="noopener noreferrer" class="">https://doi.org/10.1093/bioinformatics/btae519</a></p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/asq_preprint#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<a href="https://asq.epigraphdb.org/"><img src="https://dmer-group.example.com/img/ASQ_logo.png"></a>
<p>Integrating information from data sources representing different study designs has the potential to strengthen evidence in population health research. However, this concept of evidence “triangulation” presents a number of challenges for systematically identifying and integrating relevant information.</p>
<p><a href="https://www.medrxiv.org/content/10.1101/2022.04.12.22273803v1" target="_blank" rel="noopener noreferrer" class="">In this medRxiv preprint</a> we present ASQ (Annotated Semantic Queries), a natural language query interface to the integrated biomedical entities and epidemiological evidence in <a href="https://epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">EpiGraphDB</a> . ASQ enables users to extract “claims” from a piece of unstructured text, and then investigate the evidence that could either support, contradict the claims, or offer additional information to the query.</p>
<p>The ASQ approach has the potential to support the rapid review of pre-prints, grant applications, conference abstracts and articles submitted for peer review. ASQ implements strategies to harmonize biomedical entities in different taxonomies and evidence from different sources, to facilitate evidence triangulation and interpretation.</p>
<p>ASQ is openly available at <a href="https://asq.epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">https://asq.epigraphdb.org</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-did">What we did<a href="https://dmer-group.example.com/news/asq_preprint#what-we-did" class="hash-link" aria-label="Direct link to What we did" title="Direct link to What we did" translate="no">​</a></h2>
<p>The ASQ platform was designed as a natural language interface to the <a href="https://epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">EpiGraphDB</a> biomedical knowledge graph. ASQ considers two primary <em>evidence groups</em> in EpigraphDB:</p>
<ul>
<li class=""><em>Triple and literature</em> evidence, comprising semantic triples derived from the biomedical literature</li>
<li class=""><em>Association</em> evidence, comprising results from genetic correlation, polygenic risk score association and Mendelian randomization</li>
</ul>
<p>The user interface accepts free text entry (e.g.&nbsp;the abstract of a pre-print or journal article, the summary of a grant application, etc). We then use SemRep [1] to derive “claim triples” (Subject-PREDICATE-Object). The user then selects a triple of interest for analysis.</p>
<p>ASQ maps biomedical entities in the Subject and Object parts of the claim triple to biomedical entities in EpiGraphDB. The system then retrieves evidence from the two evidence categories (above) that link the Subject and Object.</p>
<p>Figure 1 provides an overview of the architecture of the platform.</p>
<p><img decoding="async" loading="lazy" alt="Figure showing the system design of the EpiGraphDB-ASQ platform" src="https://dmer-group.example.com/assets/images/asq_architecture-f4433639732256a17191091ef74b2113.png" width="3102" height="1858" class="img_ev3q"> <em><strong>Figure 1: Overall architecture of the EpiGraphDB-ASQ platform</strong> Overall architecture design of the EpiGraphDB-ASQ platform and its associated components in the EpiGraphDB ecosystem. Left: EpiGraphDB’s biomedical entities (in the form of graph nodes) from different taxonomies are encoded into vector representations which allows for fast information retrieval against the query of interest. Epidemiological evidence (in the form of graph edges) are incorporated into ASQ as harmonized evidence groups. Right: Internal processing workflow of the EpiGraphDB-ASQ platform by the three stages: the claim parsing stage, the entity harmonization stage, and the evidence retrieval stage</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="paper">Paper<a href="https://dmer-group.example.com/news/asq_preprint#paper" class="hash-link" aria-label="Direct link to Paper" title="Direct link to Paper" translate="no">​</a></h2>
<p>‘Triangulating evidence in health sciences with Annotated Semantic Queries’ by Yi Liu and Tom Gaunt in <a href="https://www.medrxiv.org/content/10.1101/2022.04.12.22273803v1" target="_blank" rel="noopener noreferrer" class="">medRxiv</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="code-availability">Code availability<a href="https://dmer-group.example.com/news/asq_preprint#code-availability" class="hash-link" aria-label="Direct link to Code availability" title="Direct link to Code availability" translate="no">​</a></h2>
<p>Source code for the ASQ platform and relevant analysis scripts can be found via <a href="https://github.com/mrcieu/epigraphdb-asq" target="_blank" rel="noopener noreferrer" class="">https://github.com/mrcieu/epigraphdb-asq</a>. Tutorial on programmatically accessing the ASQ platform can be found via this Jupyter notebook <a href="https://github.com/MRCIEU/epigraphdb-asq/blob/master/analysis/notebooks/programmatic-access.ipynb" target="_blank" rel="noopener noreferrer" class="">https://github.com/MRCIEU/epigraphdb-asq/blob/master/analysis/notebooks/programmatic-access.ipynb</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references">References<a href="https://dmer-group.example.com/news/asq_preprint#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<p>[1] Kilicoglu, H., Rosemblat, G., Fiszman, M. &amp; Shin, D. Broad-coverage biomedical relation extraction with SemRep. BMC bioinformatics 21, 1–28 (2020).</p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="database" term="database"/>
        <category label="epigraphdb" term="epigraphdb"/>
        <category label="MR" term="MR"/>
        <category label="NLP" term="NLP"/>
        <category label="software" term="software"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Systematic comparison of Mendelian randomization studies and randomized controlled trials using electronic databases]]></title>
        <id>https://dmer-group.example.com/news/mr-rct_preprint</id>
        <link href="https://dmer-group.example.com/news/mr-rct_preprint"/>
        <updated>2022-04-16T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Triangulating results between Mendelian randomization studies and randomized controlled trials has the potential to strengthen evidence for an intervention target. In this work, led by Maria Sobczyk, we mined ClinicalTrials.Gov, PubMed and EpigraphDB databases and carried out a series of 26 manual literature comparisons among 54 MR and 77 RCT publications to explore the potential for systematic triangulation.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/mr-rct_preprint#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>Mendelian Randomization (MR) uses genetic instrumental variables to make causal inferences. Whilst sometimes referred to as “nature’s randomized trial”, it has distinct assumptions that make comparisons between the results of MR studies with those of actual randomized controlled trials (RCTs) invaluable.</p>
<p><a href="https://www.medrxiv.org/content/10.1101/2022.04.11.22273633v1" target="_blank" rel="noopener noreferrer" class="">In this medRxiv pre-print</a> we mined ClinicalTrials.Gov, PubMed and EpigraphDB databases and carried out a series of 26 manual literature comparisons among 54 MR and 77 RCT publications.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-did">What we did<a href="https://dmer-group.example.com/news/mr-rct_preprint#what-we-did" class="hash-link" aria-label="Direct link to What we did" title="Direct link to What we did" translate="no">​</a></h2>
<img src="https://dmer-group.example.com/img/mr-rct_medRxiv.png">
<p>We downloaded all results for all trials within <a href="https://clinicaltrials.gov/" target="_blank" rel="noopener noreferrer" class="">ClinicalTrials.gov</a>, filtering them as illustrated in the figure. We used <a href="https://epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">EpigraphDB</a> to collect information about drug-target associations and semantic triples associated with selected MR and RCT publications based on a comprehensive search of PubMed. We then mapped MR studies to the corresponding RCTs to evaluate consistency and disagreement between the results.</p>
<p>We found that only 11% of completed RCTs identified in ClinicalTrials.Gov submitted their results to the database. Similarly low coverage was revealed for Semantic Medline (SemMedDB) semantic triples derived from MR and RCT publications – 25% and 12%, respectively.</p>
<p>Among intervention types that can be mimicked by MR, only trials of pharmaceutical interventions could be automatically matched to MR results due to insufficient annotation with MeSH ontology. A manual survey of the literature highlighted the potential for triangulation across a number of exposure/outcome pairs if similar challenges can be addressed. We conclude that careful triangulation of MR with RCT evidence should involve consideration of similarity of phenotypes across study designs, intervention intensity and duration, study population demography and health status, comparator group, intervention goal and quality of evidence.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="paper">Paper<a href="https://dmer-group.example.com/news/mr-rct_preprint#paper" class="hash-link" aria-label="Direct link to Paper" title="Direct link to Paper" translate="no">​</a></h2>
<p>‘Systematic comparison of Mendelian randomization studies and randomized controlled trials using electronic databases’ by Maria K. Sobczyk, George Davey Smith and Tom R. Gaunt in <a href="https://www.medrxiv.org/content/10.1101/2022.04.11.22273633v1" target="_blank" rel="noopener noreferrer" class="">medRxiv</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="code-and-data-availability">Code and data availability<a href="https://dmer-group.example.com/news/mr-rct_preprint#code-and-data-availability" class="hash-link" aria-label="Direct link to Code and data availability" title="Direct link to Code and data availability" translate="no">​</a></h2>
<p>Code used to carry out the analysis is available on GitHub: <a href="https://github.com/marynias/mr-rct" target="_blank" rel="noopener noreferrer" class="">https://github.com/marynias/mr-rct</a>. ClinicalTrials.Gov data was accessed via AACT: <a href="https://aact.ctti-clinicaltrials.org/snapshots" target="_blank" rel="noopener noreferrer" class="">https://aact.ctti-clinicaltrials.org/snapshots</a> and analysed data subset is available in Supplementary Datasets 1 &amp; 2. pQTL and eQTL MR analysis results are available via EpigraphDB: <a href="https://epigraphdb.org/xqtl" target="_blank" rel="noopener noreferrer" class="">https://epigraphdb.org/xqtl</a></p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="database" term="database"/>
        <category label="epigraphdb" term="epigraphdb"/>
        <category label="MR" term="MR"/>
        <category label="NLP" term="NLP"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Evaluating the potential benefits and pitfalls of combining protein and expression quantitative trait loci in evidencing drug targets]]></title>
        <id>https://dmer-group.example.com/news/eQTLpQTL_comparison_preprint</id>
        <link href="https://dmer-group.example.com/news/eQTLpQTL_comparison_preprint"/>
        <updated>2022-03-17T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Molecular quantitative trait loci (molQTL), which can provide functional evidence on the mechanisms underlying phenotype-genotype associations, are increasingly used in drug target validation and safety assessment. In this work, led by Jamie Robinson, we evaluate the differences between expression and protein QTL and explore the possible reasons for apparent contradictory effects of genetic variants.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/eQTLpQTL_comparison_preprint#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>Molecular quantitative trait loci (molQTL), which can provide functional evidence on the mechanisms underlying phenotype-genotype associations, are increasingly used in drug target validation and safety assessment. In particular, protein abundance QTLs (pQTLs) and gene expression QTLs (eQTLs) are the most commonly used for this purpose. However, questions remain on how to best consolidate results from pQTLs and eQTLs for target validation.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-did">What we did<a href="https://dmer-group.example.com/news/eQTLpQTL_comparison_preprint#what-we-did" class="hash-link" aria-label="Direct link to What we did" title="Direct link to What we did" translate="no">​</a></h2>
<p><a href="https://www.biorxiv.org/content/10.1101/2022.03.15.484248v1" target="_blank" rel="noopener noreferrer" class="">In this bioRxiv pre-print</a> we combined blood cell-derived eQTLs and plasma-derived pQTLs to form QTL pairs representing each gene and its product. We performed a series of enrichment analyses to identify features of QTL pairs that provide consistent evidence for drug targets based on the concordance of the direction of effect of the pQTL and eQTL. We repeated these analyses using eQŢLs derived in 49 tissues.</p>
<p>We found that 25-30% of blood-cell derived QTL pairs have discordant effects. The difference in tissues of origin for molecular markers contributes to, but is not likely a major source of, this observed discordance. Finally, druggable genes were as likely to have discordant QTL pairs as concordant.</p>
<p>Our analyses suggest combining and consolidating evidence from pQTLs and eQTLs for drug target validation is crucial and should be done whenever possible, as many potential drug targets show discordance between the two molecular phenotypes that could be misleading if only one is considered. We also encourage investigating QTL tissue-specificity in target validation applications to help identify reasons for discordance and emphasise that concordance and discordance of QTL pairs across tissues are both informative in target validation.</p>
<p><img decoding="async" loading="lazy" alt="Enrichment results for the Drug-Gene Interaction Database" src="https://dmer-group.example.com/assets/images/pqtl_qtl_bioRxiv-63adb7778638ab3770432cea99b7e914.png" width="3552" height="2002" class="img_ev3q"> <em><strong>Figure 1: Results for the enrichment analysis using the Drug-Gene Interaction database (DGIdb) for druggability-related terms.</strong> The bars show the percentage of concordant or discordant QTL pairs which were enriched for a given term. P values were calculated using Fisher’s exact test, and unadjusted and FDR-adjusted P values are shown for those terms which reached at least nominal significance.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="paper">Paper<a href="https://dmer-group.example.com/news/eQTLpQTL_comparison_preprint#paper" class="hash-link" aria-label="Direct link to Paper" title="Direct link to Paper" translate="no">​</a></h2>
<p>‘Evaluating the potential benefits and pitfalls of combining protein and expression quantitative trait loci in evidencing drug targets’ by Jamie W Robinson, Thomas Battram, Denis A Baird, Philip C Haycock, Jie Zheng, Gibran Hemani, Chia-Yen Chen and Tom R Gaunt in <a href="https://www.biorxiv.org/content/10.1101/2022.03.15.484248v1" target="_blank" rel="noopener noreferrer" class="">bioRxiv</a>.</p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="database" term="database"/>
        <category label="epigraphdb" term="epigraphdb"/>
        <category label="MR" term="MR"/>
        <category label="NLP" term="NLP"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Senior Research Associate / Research Fellow in Health Data Science]]></title>
        <id>https://dmer-group.example.com/news/job-sra-rf-health-data-science-2021</id>
        <link href="https://dmer-group.example.com/news/job-sra-rf-health-data-science-2021"/>
        <updated>2022-01-12T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We are seeking a talented postdoctoral scientist with expertise in biomedical data integration and analysis, data mining and causal inference]]></summary>
        <content type="html"><![CDATA[<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-role">The role:<a href="https://dmer-group.example.com/news/job-sra-rf-health-data-science-2021#the-role" class="hash-link" aria-label="Direct link to The role:" title="Direct link to The role:" translate="no">​</a></h4>
<p>We are seeking a talented postdoctoral scientist with expertise in biomedical data integration and analysis, data mining and causal inference. As the successful candidate you will join a vibrant interdisciplinary research environment in the <a href="https://www.bris.ac.uk/ieu" target="_blank" rel="noopener noreferrer" class="">MRC Integrative Epidemiology Unit</a>, working within a programme that applies data mining approaches to epidemiological research questions (<a href="http://www.biocompute.org.uk/" target="_blank" rel="noopener noreferrer" class="">www.biocompute.org.uk</a>). The post holder will be appointed at either Senior Research Associate (grade J) or Research Fellow (grade K) depending on their level of experience. If successful, you will have the opportunity to develop your own research portfolio within the programme, contribute to teaching and postgraduate training and will be supported in your career progression. <strong>Closing date: 6th Feb 2022</strong></p>
<a class="btn btn-success" href="https://www.bristol.ac.uk/jobs/find/details/?jobId=264874&amp;jobTitle=Senior%20Research%20Associate%20%2F%20Research%20Fellow%20in%20Health%20Data%20Science">Online applications</a>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-will-you-be-doing">What will you be doing?<a href="https://dmer-group.example.com/news/job-sra-rf-health-data-science-2021#what-will-you-be-doing" class="hash-link" aria-label="Direct link to What will you be doing?" title="Direct link to What will you be doing?" translate="no">​</a></h4>
<p>The role will be focused on systematic analysis of knowledge graphs such as EpiGraphDB (<a href="https://epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">epigraphdb.org</a>) to identify risk factors, mechanistic pathways and potential intervention targets for human disease. You will be encouraged to co-develop new project ideas that align with the research objectives of the MRC IEU, in particular those focused on data mining, knowledge extraction from the literature, evidence triangulation and knowledge graph analysis.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="you-should-apply-if">You should apply if:<a href="https://dmer-group.example.com/news/job-sra-rf-health-data-science-2021#you-should-apply-if" class="hash-link" aria-label="Direct link to You should apply if:" title="Direct link to You should apply if:" translate="no">​</a></h4>
<ul>
<li class="">You have a strong computational and analytical background, with experience in programming, data science and the application of these skills in health, biomedical or biological research.</li>
<li class="">Ideally you will be familiar with knowledge graphs and natural language processing.</li>
<li class="">You will also have experience of applied statistical analysis and/or machine learning and will have a good understanding of causal inference using graphical causal models.</li>
</ul>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="further-information-and-how-to-apply">Further information and how to apply<a href="https://dmer-group.example.com/news/job-sra-rf-health-data-science-2021#further-information-and-how-to-apply" class="hash-link" aria-label="Direct link to Further information and how to apply" title="Direct link to Further information and how to apply" translate="no">​</a></h4>
<p>Please read the <a href="https://emea3.recruitmentplatform.com/tlk/pages/fo/download_job_file.jsp?ID=Q50FK026203F3VBQBV7V77V83&amp;nDocumentID=2804301&amp;ptId=259009" target="_blank" rel="noopener noreferrer" class="">job description</a> for further information on the role and criteria. We ask you to provide evidence for how you meet these criteria in your application. If you would like more information please don’t hesitate to get in touch:  Prof Tom Gaunt (<a href="mailto:tom.gaunt@bristol.ac.uk">Email</a>)</p>
<p>You should apply using the online application system linked here through the green button below. (Applications that are not submitted to the online system by end of 31st October 2021 can’t be considered)</p>
<a class="btn btn-success" href="https://www.bristol.ac.uk/jobs/find/details/?jobId=259009&amp;jobTitle=Senior%20Research%20Associate%20%2F%20Research%20Fellow%20in%20Health%20Data%20Science">Online applications</a>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="jobs" term="jobs"/>
        <category label="health data science" term="health data science"/>
        <category label="work with us" term="work with us"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Trans-ethnic Mendelian-randomization study reveals causal relationships between cardiometabolic factors and chronic kidney disease]]></title>
        <id>https://dmer-group.example.com/news/trans-ethnic-ckd-ije</id>
        <link href="https://dmer-group.example.com/news/trans-ethnic-ckd-ije"/>
        <updated>2021-10-20T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[This paper, led by Jie Zheng, systematically analysed previously reported risk factors for chronic kidney disease in European and East Asian populations using Mendelian randomization. The analysis showed evidence of both cross-population and population-specific risk factors.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/trans-ethnic-ckd-ije#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>Most Mendelian randomization (MR) studies focus on European populations because of the wealth of genome-wide association study (GWAS) datasets available from European ancestry population samples, in contrast to other populations. However, new GWAS summary datasets from studies such as Biobank Japan, China Kadoorie Biobank and the Japan Kidney Biobank enable us to run ancestry-specific MR analyses to compare causal effects of risk factors across populations. This approach is important in the use of MR to inform public health priorities and interventions in other populations and sub-populations that have historically been under-represented in research.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-did">What we did<a href="https://dmer-group.example.com/news/trans-ethnic-ckd-ije#what-we-did" class="hash-link" aria-label="Direct link to What we did" title="Direct link to What we did" translate="no">​</a></h2>
<p><a href="https://doi.org/10.1093/ije/dyab203" target="_blank" rel="noopener noreferrer" class="">In this paper in International Journal of Epidemiology</a> we used MR to estimate the causal effects of 45 risk factors in Europeans and 17 risk factors in East Asians (based on available genetic data)on chronic kidney disease (CKD) (Figure 1). CKD was defined by either clinical diagnosis or estimated glomerular filtration rate (eGFR) of less than 60ml/min per 1.73m2.</p>
<p>In collaboration with an international team of researchers representing a number of different population studies we were able to analyse samples from: * 51,672 CKD cases and 958,102 controls of European ancestry: CKDGen, UK Biobank and HUNT * 13,093 CKD cases and 238,118 controls of East Asian ancestry: Biobank Japan, China Kadoorie Biobank and Japan-Kidney-Biobank/ToMMo</p>
<p>In European ancestry samples we found evidence of causality for: body mass index (BMI), hypertension, systolic blood pressure, high-density lipoprotein cholesterol, apolipoprotein A-I, lipoprotein(a), type 2 diabetes (T2D) and nephrolithiasis.</p>
<p>In East Asians we were only able to analyse a subset of risk factors (due to availability of exposure genetic data for MR), but found evidence of causal effects for: BMI, T2D and nephrolithiasis</p>
<p>Interestingly, in two independent analyses we observed evidence of a causal effect of hypertension on CKD risk in Europeans, but little evidence of an effect in East Asians.</p>
<p><img decoding="async" loading="lazy" alt="Study design" src="https://dmer-group.example.com/assets/images/ckd-trans-ethnic-ije-f1-5975e1c43bec92b9be73d579dbd3cea7.jpeg" width="2000" height="2305" class="img_ev3q"> <em><strong>Figure 1: Study design for the trans-ethnic MR study of CKD.</strong></em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="paper">Paper<a href="https://dmer-group.example.com/news/trans-ethnic-ckd-ije#paper" class="hash-link" aria-label="Direct link to Paper" title="Direct link to Paper" translate="no">​</a></h2>
<p>‘Trans-ethnic Mendelian-randomization study reveals causal relationships between cardiometabolic factors and chronic kidney disease’ by Jie Zheng, Yuemiao Zhang, Humaira Rasheed, Venexia Walker, Yuka Sugawara, Jiachen Li, Yue Leng, Benjamin Elsworth, Robyn E Wootton, Si Fang, Qian Yang, Stephen Burgess, Philip C Haycock, Maria Carolina Borges, Yoonsu Cho, Rebecca Carnegie, Amy Howell, Jamie Robinson, Laurent F Thomas, Ben Michael Brumpton, Kristian Hveem, Stein Hallan, Nora Franceschini, Andrew P Morris, Anna Köttgen, Cristian Pattaro, Matthias Wuttke, Masayuki Yamamoto, Naoki Kashihara, Masato Akiyama, Masahiro Kanai, Koichi Matsuda, Yoichiro Kamatani, Yukinori Okada, Robin Walters, Iona Y Millwood, Zhengming Chen, George Davey Smith, Sean Barbour, Canqing Yu, Bjorn Olav Asvold, Hong Zhang and Tom R Gaunt in <a href="https://doi.org/10.1093/ije/dyab203" target="_blank" rel="noopener noreferrer" class="">International Journal of Epidemiology (2021)</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="data-availability">Data availability<a href="https://dmer-group.example.com/news/trans-ethnic-ckd-ije#data-availability" class="hash-link" aria-label="Direct link to Data availability" title="Direct link to Data availability" translate="no">​</a></h2>
<p>The GWAS summary statistics for CKD and eGFR that were generated using UK Biobank and CKDGen data are available from the MRC-IEU OpenGWAS database <a href="https://gwas.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">https://gwas.mrcieu.ac.uk/</a> and CKDGen website <a href="http://ckdgen.imbi.uni-freiburg.de/" target="_blank" rel="noopener noreferrer" class="">http://ckdgen.imbi.uni-freiburg.de/</a>, respectively. Other datasets are available on request to the originating study.</p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="MR" term="MR"/>
        <category label="CKD" term="CKD"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[EpiGraphDB platform version 1.0]]></title>
        <id>https://dmer-group.example.com/news/epigraphdb_v1</id>
        <link href="https://dmer-group.example.com/news/epigraphdb_v1"/>
        <updated>2021-03-22T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[EpiGraphDB v1.0 and summary of features and changes.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="epigraphdb-version-10">EpiGraphDB version 1.0<a href="https://dmer-group.example.com/news/epigraphdb_v1#epigraphdb-version-10" class="hash-link" aria-label="Direct link to EpiGraphDB version 1.0" title="Direct link to EpiGraphDB version 1.0" translate="no">​</a></h2>
<p>The EpiGraphDB platform has been updated with a new major release (version 1.0). This is the first release since version 0.3 in 2020 (what a year!) as well as since the publication of the <a href="https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa961/5962087" target="_blank" rel="noopener noreferrer" class="">journal article on Bioinformatics</a>. We believe the underlying integration pipeline, data structure and architecture for the EpiGraphDB platform has now progressed sufficiently to a stable state that we are pleased to announce this major release a version 1.0!</p>
<figure><img src="https://dmer-group.example.com/img/posts/epigraphdb-v1.png" alt="EpiGraphDB version 1"><figcaption aria-hidden="true">EpiGraphDB version 1</figcaption></figure>
<p>In the following sections we highlight a few key new features and changes in this update. For more detailed and technical changes, please visit the <a href="https://docs.epigraphdb.org/CHANGELOG-v1/" target="_blank" rel="noopener noreferrer" class="">changelog</a> in the platform documentation.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="new-and-overhauled-data-sources">New and overhauled data sources<a href="https://dmer-group.example.com/news/epigraphdb_v1#new-and-overhauled-data-sources" class="hash-link" aria-label="Direct link to New and overhauled data sources" title="Direct link to New and overhauled data sources" translate="no">​</a></h4>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="clinvar">ClinVar<a href="https://dmer-group.example.com/news/epigraphdb_v1#clinvar" class="hash-link" aria-label="Direct link to ClinVar" title="Direct link to ClinVar" translate="no">​</a></h6>
<p>ClinVar is a public archive of reports of genetic variants and interpretations of their clinical relevance to disease. The variants are submitted by clinical testing laboratories, research laboratories, expert panels and other groups.</p>
<p>We import ClinVar data (extracted on 2021-01-12) as <a href="https://ftp.ncbi.nlm.nih.gov/pub/clinvar/gene_condition_source_id" target="_blank" rel="noopener noreferrer" class="">gene-disease associations</a>, available as <a href="https://docs.epigraphdb.org/graph-database/meta-relationships/##gene_to_disease" target="_blank" rel="noopener noreferrer" class=""><code>[GENE_TO_DISEASE]</code></a> relationship in EpiGraphDB. The sources of information for the gene-disease relationship include OMIM, GeneReviews, and a limited amount of curation by NCBI staff.</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="mapping-between-ebi-gwas-catalog-gwas-traits-and-efo-terms">Mapping between EBI GWAS Catalog GWAS traits and EFO terms<a href="https://dmer-group.example.com/news/epigraphdb_v1#mapping-between-ebi-gwas-catalog-gwas-traits-and-efo-terms" class="hash-link" aria-label="Direct link to Mapping between EBI GWAS Catalog GWAS traits and EFO terms" title="Direct link to Mapping between EBI GWAS Catalog GWAS traits and EFO terms" translate="no">​</a></h6>
<p>To complement existing semantic mapping between <a href="http://docs.epigraphdb.org/graph-database/meta-nodes/##gwas" target="_blank" rel="noopener noreferrer" class=""><code>(Gwas)</code></a> traits and <a href="http://docs.epigraphdb.org/graph-database/meta-nodes/##efo" target="_blank" rel="noopener noreferrer" class=""><code>(Efo)</code></a> ontology terms (<a href="http://docs.epigraphdb.org/graph-database/meta-relationships/##gwas_nlp_efo" target="_blank" rel="noopener noreferrer" class=""><code>GWAS_NLP_EFO</code></a>) we have added the official mapping from EBI GWAS Catalog (available as “ebi-a” <a href="https://gwas.mrcieu.ac.uk/datasets/" target="_blank" rel="noopener noreferrer" class="">studies</a> in OpenGWAS) and EFO terms. Such mapping is available as <a href="https://docs.epigraphdb.org/graph-database/meta-relationships/##gwas_efo_ebi" target="_blank" rel="noopener noreferrer" class=""><code>[GWAS_EFO_EBI]</code></a> in EpiGraphDB.</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="mr-eve">MR-EvE<a href="https://dmer-group.example.com/news/epigraphdb_v1#mr-eve" class="hash-link" aria-label="Direct link to MR-EvE" title="Direct link to MR-EvE" translate="no">​</a></h6>
<p>We have incorporated the latest <a href="https://www.biorxiv.org/content/10.1101/173682v2" target="_blank" rel="noopener noreferrer" class="">MR-EvE</a> evidence to EpiGraphDB. The MR-EvE evidence is represented as <a href="https://docs.epigraphdb.org/graph-database/meta-relationships/##mr_eve_mr" target="_blank" rel="noopener noreferrer" class=""><code>[MR_EVE_MR]</code></a> in EpiGraphDB. With this update, <code>[MR_EVE_MR]</code> evidence has increased from 583,619 records to 25,804,945 records (for further details visit the <a href="https://docs.epigraphdb.org/graph-database/metrics/" target="_blank" rel="noopener noreferrer" class="">metadata and metrics</a> ). For further examples regarding the MR-EvE evidence, take a look at <a href="https://epigraphdb.org/mr" target="_blank" rel="noopener noreferrer" class="">the MR view on the EpiGraphDB WebUI</a> and <a href="https://epigraphdb.org/confounder" target="_blank" rel="noopener noreferrer" class="">the confounder view</a> as well as the underlying API endpoints in the <a href="https://api.epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">EpiGraphDB API</a>.</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="reactome">Reactome<a href="https://dmer-group.example.com/news/epigraphdb_v1#reactome" class="hash-link" aria-label="Direct link to Reactome" title="Direct link to Reactome" translate="no">​</a></h6>
<p>The Reactome data source has been overhauled and simplified. We now make use of the protein and pathway data sets available to download <a href="https://reactome.org/download/current" target="_blank" rel="noopener noreferrer" class="">here</a>.</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="literature-derived-evidence">Literature derived evidence<a href="https://dmer-group.example.com/news/epigraphdb_v1#literature-derived-evidence" class="hash-link" aria-label="Direct link to Literature derived evidence" title="Direct link to Literature derived evidence" translate="no">​</a></h6>
<p>In addition to the newer version of SemMedDB (semmedVER42_R) we used <a href="https://semrep.nlm.nih.gov/" target="_blank" rel="noopener noreferrer" class="">SemRep</a> to create semantic triples from the MedRxiv and BioRxiv titles and abstracts. This resulted in renaming the literature nodes and relationships in the graph, e.g.&nbsp;instead of <code>(SemmedTriple)</code> we now have <a href="http://docs.epigraphdb.org/graph-database/meta-nodes/##literaturetriple" target="_blank" rel="noopener noreferrer" class=""><code>(LiteratureTriple)</code></a>, and instead of <code>(SemmedTerm)</code> we now have <a href="http://docs.epigraphdb.org/graph-database/meta-nodes/##literatureterm" target="_blank" rel="noopener noreferrer" class=""><code>(LiteratureTerm)</code></a>, each with a <code>_source</code> property (see <a href="https://docs.epigraphdb.org/CHANGELOG-v1/" target="_blank" rel="noopener noreferrer" class="">changelog</a>). Relationships between the new nodes are named after the data source, e.g. <a href="http://docs.epigraphdb.org/graph-database/meta-relationships/##semmeddb_obj" target="_blank" rel="noopener noreferrer" class=""><code>[SEMMEDDB_OBJ]</code></a>, <a href="http://docs.epigraphdb.org/graph-database/meta-relationships/##biorxiv_obj" target="_blank" rel="noopener noreferrer" class=""><code>[BIORXIV_OBJ]</code></a> and <a href="http://docs.epigraphdb.org/graph-database/meta-relationships/##medrxiv_obj" target="_blank" rel="noopener noreferrer" class=""><code>[MEDRXIV_OBJ]</code></a> in place of <code>[SEM_OBJ]</code>.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="codebase">Codebase<a href="https://dmer-group.example.com/news/epigraphdb_v1#codebase" class="hash-link" aria-label="Direct link to Codebase" title="Direct link to Codebase" translate="no">​</a></h4>
<p>We have refactored our entire graph build pipeline to improve transparency, reliability and robustness. For this we use the <a href="https://github.com/elswob/neo4j-build-pipeline" target="_blank" rel="noopener noreferrer" class="">neo4j-build-pipeline</a> which uses defined schemas and tests to ensure the graph is consistent and clean. More details on this can be found in the following blog posts <a href="https://www.biocompute.org.uk/post/neo4j_data_integration/" target="_blank" rel="noopener noreferrer" class="">https://www.biocompute.org.uk/post/neo4j_data_integration/</a> and <a href="https://neo4j.com/blog/neo4j-data-integration-pipeline-using-snakemake-and-docker/" target="_blank" rel="noopener noreferrer" class="">https://neo4j.com/blog/neo4j-data-integration-pipeline-using-snakemake-and-docker/</a>.</p>
<p>In addition, the source code for the <a href="https://github.com/MRCIEU/epigraphdb-graph" target="_blank" rel="noopener noreferrer" class="">Graph</a>, <a href="https://github.com/mrcieu/epigraphdb_web" target="_blank" rel="noopener noreferrer" class="">WebUI</a> and <a href="https://github.com/mrcieu/epigraphdb_api" target="_blank" rel="noopener noreferrer" class="">API</a> are now hosted on GitHub under the <a href="https://mrcieu.github.io/" target="_blank" rel="noopener noreferrer" class="">MRCIEU organisation</a>. We plan to write a separate blog post regarding the technologies behind the EpiGraphDB platform in the near term future.</p>
<p>For further information on the software side of the EpiGraphDB project (as well as other software projects developed in the MRC IEU) please visit MRC IEU’s <a href="https://mrcieu.github.io/software/epigraphdb" target="_blank" rel="noopener noreferrer" class="">GitHub Pages</a>.</p>
<hr>
<p>EpiGraphDB can be accessed and interactive with via the following ways: - The interactive <a href="https://epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">Web UI</a> - The <a href="https://api.epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">API</a> - Example <a href="https://github.com/mrcieu/epigraphdb" target="_blank" rel="noopener noreferrer" class="">Jupyter notebooks</a> - The <a href="https://mrcieu.github.io/epigraphdb-r" target="_blank" rel="noopener noreferrer" class="">R package</a></p>
<p>For further details on the EpiGraphDB research project please read our journal article published on <a href="https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa961/5962087" target="_blank" rel="noopener noreferrer" class="">Bioinformatics</a> as well as the <a href="https://docs.epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">platform documentation</a>.</p>
<p>Please get in touch with the team <a href="https://github.com/mrcieu/epigraphdb/issues" target="_blank" rel="noopener noreferrer" class="">via GitHub issues</a>, <a href="https://twitter.com/epigraphdb" target="_blank" rel="noopener noreferrer" class="">on twitter</a>, or <a href="mailto:feedback@epigraphdb.org" target="_blank" rel="noopener noreferrer" class="">via emails</a>.</p>
<p>EpiGraphDB team</p>]]></content>
        <author>
            <name>Yi Liu</name>
        </author>
        <category label="epigraphdb" term="epigraphdb"/>
        <category label="software" term="software"/>
        <category label="database" term="database"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[MendelVar: gene prioritization at GWAS loci using phenotypic enrichment of Mendelian disease genes]]></title>
        <id>https://dmer-group.example.com/news/mendelvar</id>
        <link href="https://dmer-group.example.com/news/mendelvar"/>
        <updated>2021-01-01T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[This paper, led by Maria Sobczyk, presented MendelVar, a tool which integrates knowledge from four databases on Mendelian disease genes with enrichment testing for a range of functional annotations to support the prioritization of genes at GWAS loci.]]></summary>
        <content type="html"><![CDATA[<a href="https://mendelvar.mrcieu.ac.uk/"><img src="https://dmer-group.example.com/img/mendelvar3_logo.svg"></a>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/mendelvar#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>Gene prioritization at human GWAS loci is challenging due to linkage-disequilibrium and long-range gene regulatory mechanisms. However, identifying the causal gene is crucial to enable identification of potential drug targets and better understanding of molecular mechanisms. Mapping GWAS traits to known phenotypically relevant Mendelian disease genes near a locus is a promising approach to gene prioritization. <a href="https://mendelvar.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">MendelVar</a> is a novel web-based platform to support gene prioritization using data from Mendelian disease genes, variants identified in clinical genetics and data from disease ontologies.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-platform">The platform<a href="https://dmer-group.example.com/news/mendelvar#the-platform" class="hash-link" aria-label="Direct link to The platform" title="Direct link to The platform" translate="no">​</a></h2>
<p><a href="https://mendelvar.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">MendelVar</a>, presented in <a href="https://doi.org/10.1093/bioinformatics/btaa1096" target="_blank" rel="noopener noreferrer" class="">this Bioinformatics paper</a>, provides a quick overview of possible impact of Mendelian disease-related genes on user’s complex phenotype of interest. It returns the details of all known broadly defined Mendelian diseases and their causal genes found in the custom genomic intervals as well as overlapping pathogenic rare mutations responsible for Mendelian disease. Enrichment of Disease Ontology, Human Phenotype Ontology terms among the Mendelian genes gives the researcher an overview of any shared features with their trait of interest, e.g.&nbsp;in terms of anatomy.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="data-sources">Data sources<a href="https://dmer-group.example.com/news/mendelvar#data-sources" class="hash-link" aria-label="Direct link to Data sources" title="Direct link to Data sources" translate="no">​</a></h4>
<p><a href="https://mendelvar.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">MendelVar</a> uses all the confirmed gene-disease relationships featured in <a href="https://omim.org/" target="_blank" rel="noopener noreferrer" class="">OMIM</a> and complements it with three more specialist data sources for Mendelian disease: <a href="https://www.orpha.net/consor/cgi-bin/index.php" target="_blank" rel="noopener noreferrer" class="">Orphanet</a> (a database centred on rare, typically monogenic disease), expertly curated gene panels used for diagnostics from <a href="https://panelapp.genomicsengland.co.uk/" target="_blank" rel="noopener noreferrer" class="">Genomics England PanelApp</a> and results from the on-going Deciphering Developmental Disorders Study (made available in <a href="https://www.deciphergenomics.org/" target="_blank" rel="noopener noreferrer" class="">DECIPHER</a>). <a href="https://mendelvar.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">MendelVar</a> includes short disease descriptions sourced from OMIM, Orphanet, <a href="https://www.uniprot.org/" target="_blank" rel="noopener noreferrer" class="">Uniprot</a> and <a href="https://www.ebi.ac.uk/ols/ontologies/doid" target="_blank" rel="noopener noreferrer" class="">DO</a>. In addition, <a href="https://mendelvar.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">MendelVar</a> cross-references input genomic intervals against <a href="https://www.ncbi.nlm.nih.gov/clinvar/" target="_blank" rel="noopener noreferrer" class="">ClinVar</a>.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="usage">Usage<a href="https://dmer-group.example.com/news/mendelvar#usage" class="hash-link" aria-label="Direct link to Usage" title="Direct link to Usage" translate="no">​</a></h4>
<p><a href="https://mendelvar.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">MendelVar</a> accepts a user defined list of genomic intervals or a list of top SNPs. Top SNPs can be used to create genomic intervals in two ways in <a href="https://mendelvar.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">MendelVar</a>: using pre-set basepair flanks or via generation of individual LD-based boundaries around each top SNP, identified either through dbSNP rsIDs or positional coordinates. The LD-based intervals can be also optionally extended to the nearest recombination hotspot. There is full support for hg19/GRCh37 and hg38/GRCh38 human genome builds for input. Six 1000 Genomes populations - EUR, CEU, AFR, AMR, EAS, SAS supported via LDlink for LD-based genomic interval generation.</p>
<p>The web-browser will return a compressed file containing the results of the enrichment and annotation analyses. An <a href="https://www.notion.so/mendelvar/MendelVar-tutorial-ab91d2a6acb846f2b9f2978fcd942dd5" target="_blank" rel="noopener noreferrer" class="">online tutorial</a> provides clear documentation on how to use the platform and how to interpret the results.</p>
<p>See Figure 1 for an illustration of the different user routes through <a href="https://mendelvar.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">MendelVar</a>.</p>
<p><img decoding="async" loading="lazy" alt="MendelVar User journey" src="https://dmer-group.example.com/assets/images/mendelvar_userjourney-7ae34f69aab44f96c864caf06ef1df1a.jpeg" width="520" height="1018" class="img_ev3q"> <em><strong>Figure 1: A flowchart demonstrating three possible user routes through MendelVar: (a) left: MendelVar generates fixed genomic intervals using preset left and right flanks against a user-submitted list of genomic positions; (b) centre: MendelVar generates flexible genomic intervals using LD pattern in the region around each user-submitted position/variant rsID; (c) right: MendelVar accepts user-submitted genomic intervals. The genomic intervals generated or obtained from user are subsequently bisected with coordinates for genes and variants known to cause Mendelian disease. Ontology terms associated with Mendelian disease in HPO, DO are propagated to causal genes and are tested for enrichment among target genes in input genomic intervals. MendelVar also provides an option for enrichment testing with Gene Ontology and biological pathway databases.</strong></em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="paper">Paper<a href="https://dmer-group.example.com/news/mendelvar#paper" class="hash-link" aria-label="Direct link to Paper" title="Direct link to Paper" translate="no">​</a></h2>
<p>‘MendelVar: gene prioritization at GWAS loci using phenotypic enrichment of Mendelian disease genes’ by M K Sobczyk, T R Gaunt and L Paternoster in <a href="https://doi.org/10.1093/bioinformatics/btaa1096" target="_blank" rel="noopener noreferrer" class="">Bioinformatics (2021)</a>.</p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="database" term="database"/>
        <category label="GWAS" term="GWAS"/>
        <category label="software" term="software"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Neo4J data integration pipeline]]></title>
        <id>https://dmer-group.example.com/news/neo4j_data_integration</id>
        <link href="https://dmer-group.example.com/news/neo4j_data_integration"/>
        <updated>2020-11-17T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We make extensive use of Neo4J for graph databases (including EpiGraphDB). One of the key challenges in constructing a heterogeneous graph database is the data integration from different sources. Ben Elsworth describes the pipeline he has developed to automate this process.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="background">Background<a href="https://dmer-group.example.com/news/neo4j_data_integration#background" class="hash-link" aria-label="Direct link to Background" title="Direct link to Background" translate="no">​</a></h2>
<p>We’ve been using <a href="https://neo4j.com/" target="_blank" rel="noopener noreferrer" class="">Neo4j</a> for around five years in a variety of projects, sometimes as the main database <a href="http://melodi.biocompute.org.uk/" target="_blank" rel="noopener noreferrer" class="">MELODI</a> and sometimes as part of a larger platform (<a href="https://gwas.mrcieu.ac.uk/" target="_blank" rel="noopener noreferrer" class="">OpenGWAS</a>). We find creating queries with Cypher intuitive and query performance to be good. However, the integration of data into a graph is still a challenge, especially when using many data from a variety of sources. Our latest project <a href="https://epigraphdb.org/" target="_blank" rel="noopener noreferrer" class="">EpiGraphDB</a> uses data from over 20 independent sources, most of which require cleaning and QC before they can be incorporated. In addition, each build of the graph needs to contain information on the versions of data, the schema of the graph and so on.</p>
<p>Most tutorials and guides focus on post graph analytics, not how the graph was created. Often the process of bringing all the data together is overlooked or assumed to be straight forward. We are keen to provide access and transparency to the entire process and designed this pipeline to help with our projects, but believe this could be of use to others too.</p>
<p>Our data integration pipeline aims to create a working graph from raw data, whilst running checks on each data set and automating the build process. These checks include:</p>
<p>Data profiling reports with pandas-profiling to help understand any issues with a data set Comparing each node and relationship property against a defined schema Merging overlapping node data into single node files. The data are formatted for use with the <a href="https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/" target="_blank" rel="noopener noreferrer" class="">neo4j-import tool</a> as this keeps build time for large graphs reasonable. By creating this pipeline, we can provide complete provenance of a project, from raw data to finished graph.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-pipeline">The pipeline<a href="https://dmer-group.example.com/news/neo4j_data_integration#the-pipeline" class="hash-link" aria-label="Direct link to The pipeline" title="Direct link to The pipeline" translate="no">​</a></h2>
<p>The code and documentation for the pipeline are here – <a href="https://github.com/elswob/neo4j-build-pipeline" target="_blank" rel="noopener noreferrer" class="">https://github.com/elswob/neo4j-build-pipeline</a></p>
<p>Below is a figure representing how this might fit into a production environment, with the pipeline running on a development server and shared data on a storage server</p>
<figure><img src="https://dmer-group.example.com/img/data-integration-overview-figure-1536x864.jpg" alt="Figure"><figcaption aria-hidden="true">Figure</figcaption></figure>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="setup">Setup<a href="https://dmer-group.example.com/news/neo4j_data_integration#setup" class="hash-link" aria-label="Direct link to Setup" title="Direct link to Setup" translate="no">​</a></h2>
<p>The project comes with a set of <a href="https://github.com/elswob/neo4j-build-pipeline/tree/main/test" target="_blank" rel="noopener noreferrer" class="">test data</a> that can quickly be used to demonstrate the pipeline and create a basic graph. This requires only a few steps, e.g.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">## clone the repo (use https if necessary)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">git clone git@github.com:elswob/neo4j-build-pipeline.git</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cd neo4j-build-pipeline</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">## create the conda environment</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">conda env create -f environment.yml</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">conda activate neo4j_build</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">## create a basic environment variable file for test data</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">## works ok for this test, but needs modifying for real use</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">cp example.env .env</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">## run the pipeline</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">snakemake -r all --cores 4</span><br></span></code></pre></div></div>
<p>For a new project, the steps to create a graph from scratch are detailed here and proceed as follows:</p>
<ol>
<li class="">Create a set of source data.<!-- -->
<ul>
<li class="">These can be local to the graph or on an external server</li>
<li class="">Scripts that created them should be added to the code base</li>
</ul>
</li>
<li class="">Set up a local instance of the pipeline</li>
<li class="">Create a graph schema</li>
<li class="">Create processing scripts to read in raw data and modify to match schema</li>
<li class="">Test the build steps of individual or all data files and visualise data summary</li>
<li class="">Run the pipeline<!-- -->
<ol>
<li class="">Raw data are checked against schema and processed to produce clean node and relationship CSV and header files</li>
<li class="">Overlapping node data are merged</li>
<li class="">Neo4j graph is created using neo4j-admin import</li>
<li class="">Constraints and indices are added</li>
<li class="">Clean data are copied back to specified location</li>
</ol>
</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="future-plans">Future plans<a href="https://dmer-group.example.com/news/neo4j_data_integration#future-plans" class="hash-link" aria-label="Direct link to Future plans" title="Direct link to Future plans" translate="no">​</a></h2>
<p>We think the work we have done here may be of interest to others. If anyone would like to get involved in this project we would love to collaborate and work together towards refining and publishing the method. Comments also welcome.</p>
<ul>
<li class=""><strong>Code:</strong> <a href="https://github.com/elswob/neo4j-build-pipeline" target="_blank" rel="noopener noreferrer" class="">https://github.com/elswob/neo4j-build-pipeline</a></li>
<li class=""><strong>Email:</strong> <a href="mailto:ben.elsworth@bristol.ac.uk" target="_blank" rel="noopener noreferrer" class="">ben.elsworth@bristol.ac.uk</a></li>
<li class=""><strong>Twitter:</strong> <a href="https://twitter.com/elswob" target="_blank" rel="noopener noreferrer" class="">@elswob</a></li>
</ul>]]></content>
        <author>
            <name>Ben Elsworth</name>
        </author>
        <category label="database" term="database"/>
        <category label="Neo4J" term="Neo4J"/>
        <category label="data integration" term="data integration"/>
        <category label="software" term="software"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Reducing drug development costs]]></title>
        <id>https://dmer-group.example.com/news/drug-dev-video</id>
        <link href="https://dmer-group.example.com/news/drug-dev-video"/>
        <updated>2020-11-08T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Explaining our work in a way that is accessible to a wide audience is often challenging. Here we summarise some of our approaches to drug target prioritization in a short animation.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="overview">Overview<a href="https://dmer-group.example.com/news/drug-dev-video#overview" class="hash-link" aria-label="Direct link to Overview" title="Direct link to Overview" translate="no">​</a></h2>
<p>This <a href="https://www.youtube.com/embed/t77LZZlF4iw" target="_blank" rel="noopener noreferrer" class="">short animation</a> explains how we use Mendelian randomization and colocalization to help prioritise drug targets. One of our aims in both&nbsp; <a href="http://www.biocompute.org.uk/" target="_blank" rel="noopener noreferrer" class="">programme 4 of the MRC IEU</a> and the <a href="https://www.bristol.ac.uk/integrative-epidemiology/programmes/icep/" target="_blank" rel="noopener noreferrer" class="">Integrative Cancer Epidemiology Programme</a> is to integrate such prioritizations with other data to help inform drug development.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="video">Video<a href="https://dmer-group.example.com/news/drug-dev-video#video" class="hash-link" aria-label="Direct link to Video" title="Direct link to Video" translate="no">​</a></h2>
<div><iframe src="https://www.youtube.com/embed/t77LZZlF4iw" width="320" height="240" frameborder="0" allowfullscreen="allowfullscreen"></iframe></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-animation">About the animation<a href="https://dmer-group.example.com/news/drug-dev-video#about-the-animation" class="hash-link" aria-label="Direct link to About the animation" title="Direct link to About the animation" translate="no">​</a></h4>
<p>The animation is based on recent work by <a href="https://research-information.bris.ac.uk/en/persons/jie-zheng" target="_blank" rel="noopener noreferrer" class="">Dr Jie (Chris) Zheng</a>, Vice-Chancellors Fellow based in <a href="http://www.biocompute.org.uk/" target="_blank" rel="noopener noreferrer" class="">programme 4 of the MRC IEU</a>, who recently published an innovative Mendelian randomization and colocalization study of plasma protein levels in <a href="https://www.nature.com/articles/s41588-020-0682-6" target="_blank" rel="noopener noreferrer" class="">Nature Genetics</a>, that demonstrated how genetic data can be used to support drug target prioritisation by identifying the causal effects of proteins on diseases.</p>
<p>Using a set of genetic epidemiology approaches, including Mendelian randomization and genetic colocalization, we built a causal network of 1002 plasma proteins on 225 human diseases. In doing so, we identified 111 putatively causal effects of 65 proteins on 52 diseases, covering a wide range of disease areas. The results of this study are accessible via <a href="http://www.epigraphdb.org/pqtl/" target="_blank" rel="noopener noreferrer" class="">EpiGraphDB</a>. &nbsp;</p>]]></content>
        <author>
            <name>Tom Gaunt</name>
        </author>
        <category label="drug targets" term="drug targets"/>
        <category label="video" term="video"/>
        <category label="MR" term="MR"/>
        <category label="colocalization" term="colocalization"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Visualising Brexit’s Impact on Food Safety in Britain]]></title>
        <id>https://dmer-group.example.com/news/food-safety</id>
        <link href="https://dmer-group.example.com/news/food-safety"/>
        <updated>2020-10-06T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[PhD students Marina Vabistsevits and Ollie Lloyd entereed the Jean Golding Institute data visualization competition on food hazards from around the world. Here they present their visualizations and interpretation, which won them a runner-up prize.]]></summary>
        <content type="html"><![CDATA[<blockquote>
<p>Written by Marina Vabistsevits and Oliver Lloyd, researchers on PhD studentships linked to the “Data Mining Epidemiological Relationships” programme at the <a href="https://www.bristol.ac.uk/integrative-epidemiology/" target="_blank" rel="noopener noreferrer" class="">MRC IEU</a>. Follow us on twitter – <a href="https://twitter.com/marina_vab" target="_blank" rel="noopener noreferrer" class="">@marina_vab</a>, <a href="https://twitter.com/PlotThiggins" target="_blank" rel="noopener noreferrer" class="">@PlotThiggins</a></p>
</blockquote>
<p>Leaving the EU presents many unique challenges to Britain, among which is the crucial task of maintaining our high levels of food safety. As a submission to the Jean Golding Institute’s <a href="http://www.bristol.ac.uk/golding/get-involved/competitions/food-hazards-from-around-the-world-data-competition/" target="_blank" rel="noopener noreferrer" class="">data visualisation competition</a>, we briefly investigated the impacts that Brexit may have on British food supplies. The dataset used in this analysis was made available by the Food Standards Agency (FSA) as the focus of the competition, and all code used is freely available in <a href="https://github.com/mvab/JGI-food-hazards-viz-challenge" target="_blank" rel="noopener noreferrer" class="">our github repository</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-need-for-information-recompense">The Need for Information Recompense<a href="https://dmer-group.example.com/news/food-safety#the-need-for-information-recompense" class="hash-link" aria-label="Direct link to The Need for Information Recompense" title="Direct link to The Need for Information Recompense" translate="no">​</a></h2>
<p>In the first part of the analysis, we explored cases where food imported to Britain led to an alert being raised. The two biggest sources for such alerts were Britain’s internal alert systems (largely the FSA), and the EU’s Rapid Alert System for Food and Feed (RASFF).</p>
<p>Since Britain is on course to <a href="https://www.foodsafetynews.com/2019/03/food-safety-issues-up-in-the-air-as-u-k-approaches-brexit-u-s-food-targeted/" target="_blank" rel="noopener noreferrer" class="">lose access to RASFF-supplied information</a> once Brexit is finalised in early 2021, we created the visualisation below as a comparison of the FSA and the RASFF in terms of both the number of alerts raised and the corresponding food’s origin country for each alert.</p>
<p><img decoding="async" loading="lazy" alt="Map of the world where lines between the UK and other countries indicate the countries where alerts from the Rapid Alert System for Food and Feed have originated from" src="https://dmer-group.example.com/assets/images/EU-alert-system-1536x864-995fe0fb35197527cab4eb367d7391b3.jpeg" width="1536" height="864" class="img_ev3q"> <em>Figure: Alerts from the EU Alert System</em></p>
<p>The arcs show the countries of origin of imports that raised alerts, and the yellow-red density map shows the recorded hazard alert frequency from those origins. Interactive versions of the two map instances can be found by following these links: <a href="https://mvab.github.io/JGI-food-hazards-viz-challenge/content/kepler_notifications_about_UK_all.html" target="_blank" rel="noopener noreferrer" class="">RASFF</a>, <a href="https://mvab.github.io/JGI-food-hazards-viz-challenge/content/kepler_notifications_about_UK_by_UK.html" target="_blank" rel="noopener noreferrer" class="">UK internal alerts</a>.</p>
<p><img decoding="async" loading="lazy" alt="Map of the world where lines between the UK and other countries indicate the 8 countries where alerts from the UK Internal alerts have originated from" src="https://dmer-group.example.com/assets/images/RAFSS_vs_UK_alerts_UK-1536x864-0f200ce44d8d96303ac567b3c7bfb1c0.jpg" width="1536" height="864" class="img_ev3q"> <em>Figure: Alerts from the UK Alert System</em></p>
<p>If the UK does indeed lose access to the RASFF, the loss of food hazards information about our own imports will be tremendous. The burden then falls on the FSA to develop and extend their alert system (which currently focuses very little on internationally supplied food) to bridge this information gap and ensure food safety for globally imported goods. As of the time of writing we are unsure what steps are being taken by the FSA, or the government at large, to address this issue.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="post-brexit-shifts-in-food-hazard-threats">Post-Brexit Shifts in Food Hazard Threats<a href="https://dmer-group.example.com/news/food-safety#post-brexit-shifts-in-food-hazard-threats" class="hash-link" aria-label="Direct link to Post-Brexit Shifts in Food Hazard Threats" title="Direct link to Post-Brexit Shifts in Food Hazard Threats" translate="no">​</a></h2>
<p>As an extension of this work, we turned our attention to tariffs and the effect they might have on whom Britain chooses to import from. Upon leaving the EU the UK will have to negotiate new trade deals with both EU and non-EU countries. <a href="https://publications.parliament.uk/pa/ld201719/ldselect/ldeucom/129/12905.htm" target="_blank" rel="noopener noreferrer" class="">Since the cost for EU-produced food is expected to rise for Britain after Brexit</a>, we may indeed see Britain importing more from outside of the union, which would naturally bring a shift to the make–up of food hazards that our alert systems will need to detect. Anticipating this shift will allow us to better mitigate the accompanying risk if it does begin to materialise.</p>
<p>To this end, we explored the differences in food hazard threats posed by EU vs non-EU suppliers of Britain’s largest class of imported food: fruits and vegetables. The plot below shows the relative change in frequency for each category of food hazard in the case that Britain switched from 100% EU imports of fruit and vegetables to 100% non-EU. The hazard categories that are likely to increase in non-EU imports are highlighted in red. Please note that this is the most extreme case possible and is unlikely to unfold to this extent in reality– this plot is therefore presented as a guide to the different food threats posed by EU vs non-EU imports.</p>
<p><img decoding="async" loading="lazy" alt="Bar chart showing difference in frequency of various food hazards, such as foreign bodies and allergens, after switching to non EU imports" src="https://dmer-group.example.com/assets/images/all_hazards_eu_noneu-1536x1229-62cd5eadc1d3ff9ce801b942199c391b.png" width="1536" height="1229" class="img_ev3q"> <em>Figure: Hazard alerts for fruits and vegetables: EU vs non-EU imports</em></p>
<p>Our full submission ‘Too Much Tooty in the Fruity: Keeping Food Safe in a Post-Brexit Britain’ can be found <a href="https://mvab.github.io/JGI-food-hazards-viz-challenge/" target="_blank" rel="noopener noreferrer" class="">here</a>, and includes a further breakdown of some of the categories of hazards displayed in the chart above. This work was awarded one of two <a href="https://jeangoldinginstitute.blogs.bristol.ac.uk/2020/09/16/food-hazards-from-around-the-world-data-competition/" target="_blank" rel="noopener noreferrer" class="">joint runner-up prizes of the competition</a>, tied with Angharad Stell’s Shiny app: <a href="https://a-stell.shinyapps.io/food_hazards_app/" target="_blank" rel="noopener noreferrer" class="">‘From a data space to knowledge discovery’</a>. The winner of the competition was Robert Eyre, who produced <a href="https://roberteyre.github.io/FSAComp/" target="_blank" rel="noopener noreferrer" class="">this impressive visualization dashboard</a> using D3. The Jean Golding Institute are hosting a showcase event on the 18th November, where all competition entries will be presented.</p>
<p><em>We would like to thank the <a href="http://www.bristol.ac.uk/golding/" target="_blank" rel="noopener noreferrer" class="">JGI</a> for hosting the <a href="http://www.bristol.ac.uk/golding/get-involved/competitions/food-hazards-from-around-the-world-data-competition/" target="_blank" rel="noopener noreferrer" class="">competition</a>, and our PhD supervisors, Prof.&nbsp;Tom Gaunt and Dr.&nbsp;Ben Elsworth, for encouraging us to enter.</em></p>]]></content>
        <author>
            <name>Marina Vabistsevits and Ollie Lloyd</name>
        </author>
        <category label="data visualization" term="data visualization"/>
        <category label="data science" term="data science"/>
    </entry>
</feed>