The entire workflow is depicted in Fig. 1. To analyze the affiliation between NPIs and AD, we initially performed preprocessing and integration of biomedical triples extracted from SemMedDB and SuppKG^{33}. Subsequently, we employed a number of graph illustration fashions to derive the embedding data of ADInt. Finally, we chosen the simplest mannequin for producing hypotheses relating to and NPIs for AD and additional evaluated them via the invention patterns and RWD evaluation.
Supplies
SemMedDB^{30} is a repository of semantic triples extracted from PubMed abstracts and titles utilizing the SemRep program^{34}. We obtained triples from the PREDICTION desk of SemMedDB and the supply sentences and textual content of triples from the SENTENCE and PREDICATION_AUX tables. This allowed us to complement SuppKG with a broader vary of data associated to interventions for AD past the dietary complement area. It accommodates data containing basic medication and associated data to ADRD.
Our prior research^{33} discovered that the present Unified Medical Language System (UMLS)^{35} doesn’t have adequate protection of DSs, which is a crucial class of NPIs. This additionally limits the illustration of dietary supplements within the SemMedDS. Thus, we developed the SuppKG^{33}, a KG that focuses on DS. SuppKG includes 56,635 nodes and 595,222 directed edges, together with 2928 DSspecific nodes and 164,738 edges. The nodes in SuppKG are recognized by the Idea Distinctive Identifiers (CUIs) in UMLS, whereas the predicates in UMLS Semantic Community label the sides. To simply distinguish the DSspecific nodes, a letter “D” was added earlier than the CUI representing the idea of DS. For instance, “DC0633482” was used to point that “myrtol” (CUI: C0633482) is a DS idea.
SuppKG accommodates data and triples about DS contained in iDISK^{36} and its extensions, which can not exist within the SemMedDB database. Thus, we combine SuppKG with SemMedDB to get a complete protection of DS illustration and hyperlink to different basic medical data.
To validate hypotheses arising from ADInt, we used Digital Well being Document (EHR) knowledge obtained from the College of Minnesota (UMN) Medical Knowledge Repository Ethical approval for this research was obtained from the UMN Institutional Overview Board and knowledgeable consent was obtained from all topics and their authorized guardians. The cohort underneath investigation comprised 10,844 people who had been identified with delicate cognitive impairment (MCI), amongst whom 978 subsequently obtained diagnoses of ADRD in the course of the interval spanning from 2001 to 2018. People with MCI and ADRD had been recognized through the Worldwide Classification of Illnesses (ICD) codes 331.83, 294.9, G31.84, and F09 (for MCI), and 290.40, 290.41, 331.0, 331.11, 331.19, 331.82, G30.0, G30.1, G30.8, G30.9, G31.01, G31.09, G31.83, F01.50, and F01.51 (for ADRD). For the MCI cohort, people had been required to own at the least one documented analysis of MCI and an absence of ADRD diagnoses. The ADRD cohort encompassed people assembly the next standards: (1) receipt of an ADRD analysis, (2) documentation of a previous MCI analysis previous the ADRD analysis, and (3) a minimal interval of six months between the preliminary MCI analysis and subsequent ADRD analysis.
Preprocessing and integration
To boost the illustration of nodes and relations within the KG, we carry out preprocessing earlier than integrating SuppKG and SemMedDB for filtering out generic, uninformative and incorrect triples. The preprocessing consists of three steps^{31}:

(1)
Filtering triples by guidelines. First, we eliminated nodes within the graph that represented generic ideas by referencing the GENERIC_CONCEPT desk offered by the SemMedDB database. This desk contained ideas akin to “Illness” and “Cells,” that are identified to be too broad to be helpful for data discovery. Moreover, ideas with semantic teams that weren’t more likely to be helpful for predicting interventions for AD had been eradicated, akin to “Actions & Behaviors,” and “Ideas & Concepts”. Lastly, solely relations that had been deemed related for LP had been saved, together with AFFECTS, CAUSES, COEXISTS WITH, PREVENTS, TREATS, and many others.

(2)
Eradicating highdegree ideas and uninformative semantic relations. Excessivedegree ideas within the KG could also be too basic to be helpful for data discovery resulting from their broad associations with many different ideas. To deal with this subject, we first computed the outdegree ((k_{i}^{in})) and indegree ((k_{i}^{out})) of every node within the KG. Subsequent, we calculated a log chance measure generally known as (G^{2})^{37} for every triple, which quantifies the energy of the connection between the nodes within the triple. The (G^{2}) method is given by:
$$G^{2} = 2mathop sum limits_{i,j,ok} n_{ijk} occasions logleft( {frac{{n_{ijk} }}{{m_{ijk} }}} proper), m_{ijk} = frac{{mathop sum nolimits_{i} n_{jk} occasions mathop sum nolimits_{j} n_{ik} occasions mathop sum nolimits_{ok} n_{ij} }}{{T^{2} }}$$
the place (n_{ijk}) is the merchandise i, j, ok within the commentary desk (OT) containing noticed frequencies of a triple, (m_{ijk}) is the merchandise i, j, ok within the expectation desk (ET) describing the anticipated values assuming independence of phrases in triples, and (T = sum n_{ijk}). Lastly, we normalized (k_{i}^{in}), (k_{i}^{out}) and (G^{2}) and summed them up collectively to get a ultimate rating for every triple. The next rating signifies that the triple is much less particular and informative. Consequently, we filtered out some higherscoring triples to handle the scale of the KG to roughly 1.8 M triples, which may be processed by our GPU in an inexpensive period of time.

(3)
Additional eradicating incorrect triples by a educated PubMedBert mannequin. The triples extracted from the SemMedDB database via SemRep could include false positives, because the semantics expressed by the triples could differ or be opposite to the content material of their supply sentences. To deal with this subject, we utilized a PubMedBert binary classification mannequin that was finetuned in our earlier work to judge the correctness of the triples by referencing their supply sentences^{31}. The F1 rating of this mannequin was 0.854, with a recall of 0.895 and a precision of 0.816.
After preprocessing, we builtin the ensuing triples from each sources. For DS idea nodes in SemMedDB triples, we added the letter D earlier than their CUIs to match the identifiers in SuppKG. As the topic and object entities of the builtin triples are recognized by UMLS CUIs and their predicates come from the UMLS Semantic Community, we added new triples to SuppKG that didn’t overlap with its current triples, with out mapping ideas or integrating ontologies. The ensuing builtin KG, named ADInt, was obtained.
NPI nodes identification
We educated and evaluated completely different approaches to establish nodes representing DS and CIH ideas in ADInt. In SuppKG, DS idea nodes are denoted by a particular mark, a letter D added earlier than their CUI. This mark was retained in the course of the integration of SuppKG and SemMedDB triples, permitting us to simply establish these nodes as DS ideas. Not like DS nodes, nodes describing CIH ideas can’t be recognized instantly from the KG. To beat this limitation, we developed an inventory of CIH ideas, generally known as the CIH ideas checklist or CIHLex^{38}.
Hyperlink prediction fashions coaching and analysis
A KG may be represented as a labeled directed multigraph (KG = left( {E,R,G} proper)), the place E denotes the set of nodes representing entities, R denotes the set of edges representing relations, and (G subseteq E occasions R occasions E) is a set of triples 〈h, r, t〉, the place h represents the top entity, r represents the relation, and t represents the tail entity. Regardless of the huge quantities of data contained in KGs, they’re typically incomplete resulting from numerous components, akin to noise, lacking knowledge, and sparsity. Thus, hyperlink prediction (LP) strategies search to deduce new triples that will not be explicitly represented within the KG, however which may be logically deduced from the present ones. The target of LP goals to foretell essentially the most possible entity or relation that completes (h, r, ?) (tail prediction), (h, ?, t) (edge prediction), or (?, r, t) (head prediction). LP for KGs may be represented as a rating process, which goals to study a prediction operate that assigns greater scores to true triples and decrease scores to false triples. To carry out LP on our KG, we explored 4 KG embedding fashions (TransE^{39}, RotatE^{40}, DistMult^{41} and ComplEX^{42}) and two graph convolutional community fashions (RGCN^{43} and CompGCN^{44}).
TransE^{39} is a straightforward and efficient mannequin for LP, significantly for modeling onetoone relations. In TransE, a triple (h, r, t) is represented as a translation from the embedding of the top entity h to the embedding of the tail entity t, with the relation r performing as the interpretation vector within the embedding area. This formulation implies that if a triple (h, r, t) exists, the embedding of entity h plus the illustration of relation r needs to be near the embedding of entity t. The TransE rating operate measures the plausibility of a triple and is outlined as follows
$$sleft( {h,r,t} proper) = left {left {{varvec{h}} + {varvec{r}} – {varvec{t}}} proper} proper$$
the place ({varvec{h}}, {varvec{r}}, {varvec{t}} in {mathbb{R}}^{d}) is the embedding of h, r and t. Not like TransE, The RotatE^{40} mannequin converts every relation to a rotation from a head entity to a tail entity in a fancy vector area and the rating operate may be outlined as
$$sleft( {h,r,t} proper) = left {left {{varvec{h}} circ {varvec{r}} – {varvec{t}}} proper} proper$$
the place ○ is a Hadamard product.
DistMult^{41} is essentially the most fundamental semantic matching fashions, and its scoring operate may be outlined as
$$sleft( {h,r,t} proper) = {varvec{h}}^{T} {varvec{rt}}$$
The downside of DistMult is that it solely works on symmetric relations, that’s, the scores of (h,r,t) and (t,r,h) calculated by DistMult are the identical. It might trigger issues in our KG, for instance the triple (Bariatric Surgical procedure, TREATS, Alzheimer’s) and the triple (Alzheimer’s, TREATS, Bariatric Surgical procedure) ought to have inconsistent scores. To deal with this limitation, ComplEX^{42} has been proposed as an extension of DistMult. ComplEX makes use of a fancy vector area and is able to modeling uneven relations. Particularly, head and tail embeddings of the identical entity are represented as complicated conjugates, which permits (h, r, t) and (t, r, h) to be distinguished. This permits ComplEX to offer constant scores for each symmetric and uneven relations. The scoring operate of ComplEX may be outlined as follows
$$sleft( {h,r,t} proper) = Releft( {{varvec{h}}^{T} {varvec{rt}}} proper)$$
the place Re (·) is an actual a part of a fancy vector.
GCNs are a neural community strategy for processing graphstructured knowledge^{45}. Nonetheless, most current GCNs are designed for easy undirected graphs and can’t deal with the a number of sorts of nodes and directed hyperlinks that exist in our KG. To deal with this problem, we explored particular graph convolutional neural community fashions that may deal with heterogeneous graphs. Particularly, we evaluated two fashions: Relational Graph Convolutional Community (RGCN)^{43} and CompGCN^{44}. Based mostly on the architectures of GCNs, RGCNs think about every completely different relation and carry out function fusion to take part in updating the hidden states of nodes^{43}. The propagation mannequin for calculating the forwardpass replace of a node in RGCNs may be outlined as
$${varvec{x}}_{i}^{{left( {l + 1} proper)}} = sigma left( {mathop sum limits_{{repsilon {mathcal{R}}}} mathop sum limits_{{jepsilon {mathcal{N}}_{i}^{r} }} frac{1}{{c_{i,r} }}{varvec{W}}_{r}^{left( l proper)} {varvec{x}}_{j}^{left( l proper)} + {varvec{W}}_{0}^{left( l proper)} {varvec{x}}_{i}^{left( l proper)} } proper),$$
the place ({varvec{x}}_{i}^{left( l proper)} ominus epsilon {mathbb{R}}^{{d^{left( l proper)} }}) is the hidden state of ith nodes within the lth layer of the neural community; ({mathcal{R}}) is the set of relations and ({mathcal{N}}_{i}^{r}) denotes the neighbor set of ith node underneath relation (repsilon {mathcal{R}}); ({varvec{W}}_{r}^{left( l proper)}) and ({varvec{W}}_{0}^{left( l proper)}) are the learnable weight matrix underneath relation (r) and selfloop weight matrix within the lth layer respectively; (c_{i,r}) is a normalization fixed that may both be realized or chosen prematurely. Utilizing RGCNs for LP duties may be thought to be a strategy of encoding and decoding: an RGCN producing latent function vectors of entities and a tensor factorization mannequin exploiting these vectors to foretell edges. Taking the DistMult decomposition for example, the rating of a triple (h, r, t) is calculated as^{43}
$$sleft( {h,r,t} proper) = {varvec{h}}^{T} {varvec{rt}}$$
Thus, to make the mannequin rating observable triples greater than detrimental triples, the loss operate may be outlined as^{43}
$${mathcal{L}} = – frac{1}{{left( {1 + omega } proper)left {hat{varepsilon }} proper}}mathop sum limits_{{left( {h,r,t,y} proper)epsilon {mathcal{T}}}} yloglleft( {sleft( {h,r,t} proper)} proper) + left( {1 – y} proper)logleft( {1 – lleft( {sleft( {h,r,t} proper)} proper)} proper)$$
the place ({mathcal{T}}) is the set of all triples (together with optimistic and detrimental triples); (omega) is the variety of detrimental triples; (left {hat{varepsilon }} proper) is the variety of edges; (l)(.) is the logistic sigmoid operate; and (y) is an indicator, the place (y = 1) means triple is optimistic, in any other case detrimental.
CompGCN^{44} is one other prolonged model of GCN for heterogeneous graphs, which systematically leverages entityrelation composition operations and collectively studying latent function vector representations for each nodes and edges within the graph. Completely different from RGCNs, CompGCN performs a composition operation Ф over every edge within the neighbor of central node via the embedding of edges and nodes. The replace equation of nodes embedding in CompGCN may be outlined as^{44}
$${varvec{x}}_{i}^{{left( {l + 1} proper)}} = f(mathop sum limits_{{left( {j,ok} proper)epsilon {mathcal{N}}_{i}^{r} }} {varvec{W}}_{lambda left( ok proper)}^{left( l proper)} phi ({varvec{x}}_{j}^{left( l proper)} ,{varvec{y}}_{ok}^{left( l proper)} ))$$
the place ({varvec{x}}_{j}^{left( l proper)}) and ({varvec{y}}_{ok}^{left( l proper)}) are the hidden state of neighboring jth node and its okth relation respectively within the lth layer, and ({varvec{W}}_{lambda left( ok proper)}^{left( l proper)}) is a relationtype particular parameter, which can be utilized for course particular weights. In accordance with whether or not the sting is the unique edge, inverse edge or selfloop edge, ({varvec{W}}_{lambda left( ok proper)}^{left( l proper)}) will correspond to completely different weight matrices. (phi left( . proper)) is used to combination two vectors of the identical dimension, which may be Subtraction^{39}, Multiplication^{41}, or Roundcorrelation^{46}. After updating the node embeddings, we are able to additionally replace the relation embedding as follows^{44}
$${varvec{y}}_{ok}^{{left( {l + 1} proper)}} = {varvec{W}}_{rel}^{ok} {varvec{y}}_{ok}^{left( l proper)} ,$$
the place ({varvec{W}}_{rel}^{ok}) is a weight matrix that tasks all relations to the identical embedding area as nodes, which permits them for use within the subsequent layer. Much like RGCNs LP mannequin, we choose a tensor factorization mannequin (convE) to calculate the rating of triples. And the identical customary binary cross entropy loss operate is utilized to coaching the convolutional networks.
The hyperparameters for TransE, RotatE, DistMult, and ComplEX had been tuned utilizing a grid search on the validation set for every prediction mannequin. We adjusted the next parameters: studying price {0.01, 0.001}, variety of hidden dimensions {100, 200, 400}, and regularization coefficient {1*10^{–6}, 1*10^{–9}}. The minibatch dimension was set to {250, 1000}. Within the case of RGCN and CompGCN, we performed tuning on the training price {0.01, 0.001}, variety of hidden dimensions {100, 200}, variety of GCN layers {1, 2}, and maintained a minibatch dimension of {250, 500, 1000}. For RGCN, we utilized a dropout layer with a price of 0.2 to the GCN encoder to stop overfitting and launched l2 regularization to the hyperlink prediction decoder with a penalty of 0.01. For CompGCN, regularization for the GCN encoder concerned a function dropout price of 0.1 and a dropout price of 0.3 after every layer, and the convE decoder employed dropout charges of 0.3 for hidden layer outputs and options. The composition operation employed was circularcorrelation. For all fashions, detrimental samples had been generated by randomly corrupting the heads or tails of optimistic triples at a 1:20 ratio in the course of the coaching course of.
All work was performed utilizing Python scripts. The implementation of the TransE, RotatE, DistMult, and ComplEX fashions was carried out with the DGLKE 0.1.0.dev0 bundle^{47} bundle. Each RGCN and CompGCN fashions had been constructed utilizing the torch 1.13.1^{48} and DGL 1.0.1^{49} packages. We describe coaching and analysis particulars within the following duties.
Open LBD process
The open discovery strategy is particularly geared toward producing modern hypotheses. Given a head node, the system produces related tail nodes, thereby facilitating the identification of beforehand unexplored triple relationships^{50}. To guage the effectiveness of our LP mannequin, we utilized two analysis strategies.
The primary one is Time Slicing^{51}. This analysis strategy includes partitioning the KG at a selected time and utilizing the info previous to this time to coach the mannequin, and subsequently testing the mannequin on the info following this time to find out if the hyperlinks shaped after the partitioning time may be precisely predicted. Particularly, in our work, we ordered the triples chronologically and divided the KG into coaching, validation, and testing units in an 8:1:1 ratio, the place the date of publication of the paper mentioning the triple is used as its time, and the partitioning occasions had been set as April 2020 and April 2021, respectively. To guage the mannequin efficiency, we compute rankingbased metrics for every mannequin: imply rank (MR), imply reciprocal rank (MRR), and Hits@ok (ok = 1, 3, and 10)^{39}. Particularly, for every true triple within the testing set, we generated a batch of detrimental samples by randomly changing the top or tail nodes whereas making certain that these detrimental samples don’t exist in our graph, i.e., we employed corruption with filtering. We then used the educated mannequin to calculate the scores for the true triple and its detrimental samples, and obtained the ranks of the true triples to calculate the metrics of MR, MRR, and Hits@ok. MR represents the common rank assigned to the true triples within the check set:
$$MR = frac{1} T propermathop sum limits_{tepsilon T}^ T proper rankleft( t proper)$$
the place T is all triples within the check set, and rank(t) is the place of the triple t within the sorted checklist of t and its detrimental pattern.
MRR is the common inverse rank of all true triples within the check set:
$$MRR = frac{1} T propermathop sum limits_{tepsilon T}^ T proper frac{1}{rankleft( t proper)}$$
Hits@ok is the proportion of triples wherein the true triple seems within the prime ok ranked triples:
$$Hits@ok = frac{1} T propermathop sum limits_{tepsilon T}^ T proper Ileft[ {rankleft( t right) le k} right]$$
the place I is an indicator operate. (Ileft[ {rankleft( t right) le k} right]) is the same as 1 if t is ranked between 1 and ok, 0 in any other case.
Within the second analysis strategy, we utilized medical trial knowledge from ClinicalTrials.gov as a benchmark for predicting potential interventions for AD. Our strategy was based mostly on the belief that interventions underneath investigation for AD have the potential to be repurposed for different indications. Particularly, we obtained an inventory of interventions utilized in AD medical trials registered after April 21, 2020, by conducting a seek for the time period “Alzheimer” and proscribing the outcomes to interventional research as of November 4, 2022. We excluded management interventions labeled as “placebo,” leading to a complete of 671 interventions. We processed these interventions utilizing MetaMap with the UMLS 2022AA launch to establish related UMLS ideas, leading to 1606 ideas. The CUIs of those ideas had been subsequently used as head nodes, with “PREVENTS” serving because the relations and the “C0002395” (CUI of AD idea) as tail nodes, making a sequence of triples for testing. Lastly, we employed these newly generated triples based mostly on medical trial knowledge as one other check set to judge every educated mannequin.
Closed LBD process
The closed discovery technique strives to establish the connections between the given head and tail nodes to be able to consider a selected speculation^{50}. Though the KG embedding and graph neural community fashions solely present node and edge representations, patterns from closed discovery had been used to deduce doable mechanisms for the repurposed interventions. To uncover potential logical connections between ideas in a community, we employed a closed discovery strategy by combining sequences of relation varieties^{32}. For DS, The invention patterns we centered on had been:
InterventionAINHIBITSINTERACTS_WITHConceptB AND
ConceptBAFFECTSCAUSESPREDISPOSESASSOCIATEDAlzheimer’s illness AND
NOT (InterventionAPREVENTSAlzheimer’s illness)
the place InterventionA is a node whose sort is DS; ConceptB may be any idea;  signifies logical OR; and for Alzheimer’s illness, we concentrate on the node with CUI C0002395. To research the repurposing potential of CIH interventions, we encountered a problem as a result of UMLS semantic sorts of most CIHs being “topp” (Therapeutic or Preventive Process) or “dora” (Day by day or Leisure Exercise). As these varieties should not have INHIBIT or INTERACT_WITH relationships to different ideas within the UMLS Semantic Community, and the variety of doable paths is just not intensive, we didn’t constrain the predicates within the patterns. The invention patterns for CIH had been:
InterventionB—(any predicate)ConceptB AND
ConceptB(any predicate)Alzheimer’s illness AND
NOT (InterventionBPREVENTSAlzheimer’s illness
the place InterventionB is a node whose sort is CIH. We visualized the community construction utilizing ChiPlot (https://www.chiplot.online/).
Analysis via RWD evaluation
To additional assist our outcomes, we carried out RWD evaluation for our predicted nonpharmacological interventions. The DS had been recognized from the structured treatment orders and unstructured medical notes; and the CIH had been recognized from the structured Present Procedural Terminology (CPT) codes. By means of Energy Evaluation (see Supplementary Fig. S1 online), we decided that reaching greater than 80% statistical energy requires a pattern dimension of roughly 1000 people, with over 20% of them receiving therapy with both DS or CIH. Upon examination of the dataset, it was revealed that solely psychotherapy (42.9%) and guide remedy methods (28.2%) met this criterion. Subsequently, every 60day interval following MCI analysis was utilized as a time sequence, extending till the ultimate go to recorded inside a tenyear timeframe. Publicity teams for ADRD sufferers had been established based mostly on the utilization of CIHs (Psychotherapy and Guide remedy methods) by MCI sufferers. Kaplan–Meier plots had been employed to visually symbolize the unadjusted chance of ADRD throughout the uncovered group. To evaluate the impression of CIHs on ADRD incidence, a multivariateadjusted Cox regression mannequin was utilized. The preliminary mannequin was adjusted for age and intercourse, whereas the second mannequin included extra covariates akin to delirium, psychological retardation, aphasia, melancholy, anxiousness, bipolar dysfunction, hypertension, hyperlipidemia, vitamin B12 deficiency, and cardiovascular illnesses, all of that are identified to be related to ADRD. Moreover, a case–management dataset was constructed from the MCI affected person cohort, with sufferers ultimately identified with ADRD serving as circumstances. Fisher’s precise check was then employed to judge statistically important variations between sufferers who utilized the expected DS and those that didn’t. All analyses had been carried out utilizing Python 3.9 with the lifelines 0.27, scipy 1.10, and matplotlib 3.7 packages.
Ethics declarations
All strategies had been carried out in accordance with the related pointers and rules.