Matching Patient Cases to Clinical Trials with Machine Learning and Information Retrieval Models
Phase 3: Word Embeddings
January 10, 2022
Index
1. Introduction2. Experimental setup3. Implemented methods4. Results discussion
5. Conclusion
1. Introduction
The vast majority of clinical trials fail to meet their patient recruitment goal. NIH has estimated that 80% of clinical trials fail to meet their patient recruitment timeline and, more critically, many (or most) fail to recruit
the minimum number of patients to power the study as originally anticipated. Efficient patient trial recruitment is thus one of the major barriers to medical research, both delaying trials and forcing others to terminate entirely.
In
this project you built a system to retrieve clinical trials from
ClinicalTrials.gov, a required registry for clinical trials in the United States. The goal is to find clinical
trials where patients can be enrolled.
The project was divided into three phases, which allowed observing the evolution of the related technology throughout the years.
The first phase of the
project was about text analysis and matching documents to queries based on certain relevant aspects. In addition, there was a strong component on metrics to analyse the correctness of the matching made before. To do such things, we
had to use already given code to read the data and build a Vector Space Model based on it, as well as compute the metrics associated with the trial. Besides that, we had to
implement our own Language Model, based on certain frequencies
of words in documents, and use it to score the different documents based on the set of test queries we had. Comparing both approaches with the metrics was essential to understand the best one for each case.
The second
phase of the project was about learning how to better rank documents, using different sections of the clinical trial document as predictors of the clinical trial relevance for each patient case. To do so, we first had to use the algorithms
of the first phase to compute the predictors, scoring the documents accordingly. After that, it was necessary to combine those scores (both LMJM and VSM of different sections) and train a logistic regression to generate the best coefficients
to weight the different models for each section. Then we used a test set to calculate an estimate for the true quality of the prediction and calculated the metrics for the new LETOR model.
Finally, the theoretical
study ended in techniques and algorithms of latest text analysis. The “Word Embeddings” brought an improved way of representing words, based on techniques such as “Word2Vec”, capable of establishing associations between tokens based
on semantics. On the other hand, “BERT” allowed to establish relationships between tokens based on the context in which these are found in a particular document.
This time, the focus of the third phase of the project
was the use of these improved algorithms to represent and visualize relationships between tokens of a given document, as well as program a model capable of classifying relevant and non-relevant documents for a certain “query”.
Figure 1. The second phase of the project: The LETOR model
2. Experimental setup
For this phase of the project, we subdivided the corpus a lot, choosing only one query and two documents, only one of which was relevant to the query used. For each one of the documents, we only used the “detailed_description”
field, since it was enough for the analysis we proposed. We are therefore left with two corpuses, each with a special [CLS] token at the beginning and two other special [SEP] tokens, one that separates the query from the document and
another at the end of the corpus.
Specifically for the division of the text into tokens and the extraction of their embeddings, it was necessary to use the BERT model (“Bidirectional Encoder Representations from
Transformers”). In addition, it was also necessary to visualize the data obtained. For all these reasons, the project was developed in python, with libraries already developed and optimized to solve these problems. Some of the most
important are:
- transformers (transformer-type models already trained);
- bertviz (view attention on models Transformer type, namely BERT);
- sklearn (notably the PCA and TSNE algorithms to reduce the dimensions of the data used);
- numpy;
-
torch (framework for machine learning, namely to work with tensors, matrixes of multiple dimensions);
- matplotlib (to display data images).
3. Implemented methods
To be able to process the text and extract the initial and contextual embeddings, we used the Transformer
BioBERT, a biomedical language representation model designed
for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc.. This template provides ways to split text into tokens and later train and extract their embeddings by layers.
Thus, it is possible to extract initial embeddings accessing layer zero and final embeddings accessing the last layer (which in this model is the 13th). It is important to note that the initial layer contains the pre-trained embeddings
and context-independent, which does not happen with the last layer, where the relationships between the tokens in the given corpus are already considered. It is important to remember that for this work we chose two limited sets of
tokens (one for each corpus) in order to produce better visualizations. These were chosen so that they had a significant relationship with the corpus in question. This time, the extraction and training of embeddings occurred only for
the tokens included in the mentioned sets.
To compare the embeddings, it was necessary to resort to the mechanisms of attention and self-attention. The first one looks at the context in which each token occurs in
the text in order to modify its embedding, approaching tokens with a similar context. The second allows interaction between input tokens, calculating the attentions of each in relation to the others. The main focus of this work is
to use this model in order to make various types of visualization, listed and explained below:
Layer embeddings visualization: This type of visualization is achieved by reducing
the number of dimensions of the embeddings to only two, using the Principal Component Analysis (PCA) algorithm. With this, a 2D scatter plot can be produced with representative points of each token, making it possible to analyze the
similarity of the tokens based on the distances between the points that represent them (smaller distance is equivalent to greater similarity);
Layer embeddings similarity visualization: This
visualization presents a matrix that reproduces the similarity of each token to all the other tokens, using the cosine distance between vectors of embeddings. Warmer colors (namely yellow) represent greater similarity, while cooler
colors (namely purple) show less similarity. Thus, it is expected to see a diagonal with colors close to yellow, since this represents the relationship of a token to itself (which does not vary significantly). This visualization was
used to verify how similarities changed between the input and output of each layer;
Self-attention head visualization: This view presents an interactive graph where you can
select certain layers and heads. In each of these, it is possible to see the connections that are established between tokens in a self-attention mechanism. The graph is especially important because it shows what connections are established
in each head, allowing us to infer what kind of relationship or learning took place at that stage.
4. Results discussion
Layer embeddings visualization: For this visualization, we present two plots, for the tokens of the document relevant and not relevant to the chosen query:
Figure 2. Scatter plot for the relevant document
Figure 3. Scatter plot for the non-relevant document
These plots show the points representative of the tokens selected. In red are the layer 0 points (initial embeddings) and in blue are the points of the last layer (final embeddings). Initial embeddings are only related to the
pre-trained word semantics , so points with similar meanings are closer to each other. We can verify that, along the layers, there was an adaptation of the embeddings, which resulted in the greater proximity of tokens that appear in
the same context as the corpus. This proves that the model adapts, not only to semantics, but also to the proximity of words. The graphs do not show results that are too different, despite the relevance to the query being different.
This happens because the model adapts to the context regardless of this factor.
Layer embeddings similarity visualization: In this case, we show all similarity matrices between consecutive layers. Each one of them presents the comparison between the input and output embeddings of each layer:
Figure 4. Similarity matrix for relevant document
Figure 5. Similarity matrix for non-relevant document
The graphs show that the embeddings change between layers, producing warmer colors if there are fewer changes. This allows us to deduce that, in the case of the relevant document, there is a greater convergence, as there are fewer and fewer
changes between consecutive layers. Now for the non-relevant document this is verified on a smaller scale, with not as much convergence.
Figure 6. Similarity matrix between layers 0 and 12 for the relevant document
Figure 7. Similarity matrix between layers 0 and 12 for the non-relevant document
On the other hand, these plots show the differences between the first and the last layer, showing how much the embeddings were changed between these. Colors are inverted for better visualization, so Warmer colors correspond to greater changes.
This makes it possible to verify that there have been major changes among these. It is also possible to verify that, for the relevant document, the most correlated with the theme of the trial and query are those that change the embeddings the
most. This happens with the tokens “students”, “young”, “health” (the document and query are related to stress in young people). Regarding the non-relevant document, there is also an adaptation to the corpus chosen, but not related to the query
(the embedding of “sleep”, for example, does not change too much).
Self-attention head visualization: This view shows relationships between tokens by head and by layer. In almost all of the examples we chose the tokens in which the corpus contained a relevant document, as these were the
examples that presented more interesting patterns. In addition, for the non-relevant document there was less links (and were also less accentuated) between words, in particular to the [CLS] token. Below are some examples of different learnt aspects:
Figure 8. Pattern 1
Figure 9. Pattern 2
Figure 10. Pattern 3
Figure 11. Pattern 4
Figure 12. Pattern 4
Figure 13. Pattern 5
Figure 14. Pattern 1/2
Figure 15. Pattern 4
Figure 8 – This specific head correlates the words with the word immediately following it (the first words chosen from this set form part of a sentence that is also in the query).
Figure 9 – This example
is similar to the previous one, but demonstrates attention to the previous word, linking them.
Figure 10 – This head identifies words and expressions that are related to each other (in the same sentence), even though
they are not completely close in the text. The sentence is: "Depression, anxiety and stress are among the primary causes of disease rates worldwide and are the most prevalent mental health problems in the U.S". As we can see, the
words “mental”, “health” and “problems” refer to “Depression ”, “anxiety” and “stress”, being this type of relationship that the head captures.
Figure 11 – This example captures relationships between words in different
sentences. In this case, a morphological relationship is presented between the words “student” and “students”.
Figure 12 – This one also presents the same pattern mentioned in the previous figure, but also captures
the semantics, linking words like “Young” and “22”, “Young” and “student”.
Figure 13 – This head captures the possible prediction of words, here evidenced by the prediction of the word “problems” by the word “mental”,
being separated from each other by the word “health”.
Figure 14 – This image presents an example in which the head identifies two distinct patterns (already evidenced in the patterns in figures 7 and 8), capable
of associating a word with either the word that precedes it or the one that precedes it.
Figure 15 – This image just shows a curious relationship between a verb in the infinitive and its gerund, linking “sleep” to
“sleeping”. This case is from a non-relevant document, although this has no influence on how the relationship is established between the words in the document.
5. Conclusion
This project brought the possibility of applying and putting into practice the theoretical knowledge studied in class. Facing the real difficulties of programming, using or analyzing the results of information retrieval algorithms
made us learn more about them and realize their true usefulness in real issues, such as the attribution of patient cases to clinical trials. In addition, seeing the evolution of technology was very interesting, going through simple models
based on the frequency of words/tokens in the corpus and others capable of better understanding the semantic similarities and the correlation between tokens looking at the context in which they are inserted.
This phase concerns
the last mentioned models, which turn out to be more robust and provide a better analysis of the text. The word2vec and BERT models are pre-trained with a large set of documents, which makes them very good at understanding the semantic
similarity between words. In addition, BERT allows a more personalized training, using multiple layers and heads to learn various correlations between words, based on the context in which they occur in the corpus of data provided.
With
that being said, we believe that analyzing the data has helped us understand how algorithms work, what they are capable of doing, and how we could use them on a personal or professional level to improve an information retrieval activity.