Originally published at: https://developer.nvidia.com/blog/nvidia-digits-alzheimers-disease-prediction/
Pattern recognition and classification in medical image analysis has been of interest to scientists for many years. Machine learning techniques have enabled researchers to develop and utilize complicated models to classify or predict various abnormalities or diseases. Recently, the successful applications of state-of-the-art deep learning architectures have rapidly expanded in medical imaging. Cutting-edge deep learning…
Originally published at: https://developer.nvidia.com/blog/nvidia-digits-alzheimers-disease-prediction/
Why after showing me a beautiful ROC curve do you in the result section state the accuracy result and not the ROAUC value? The whole reason for ROAUC is to avoid the pitfalls of using accuracy as your evaluation metric, ie high accuracies can be achieved concurrently with very low sensitivities or specificities in rare or common conditions, respectively. I'm left wondering what the actual AUC value is. If you need help with medical/scientific writing from someone who also knows machine learning, feel free to contact me. Chip Reuben, MS
Thank you for this article! It seems that axial slices at different location/time from the *same* subject were used both in training and testing sets, meaning that these sets are not separated on the subject level. Am I understanding this correctly?
Thanks for your interest in this work. And I also need to thank you for attention to details . In this work, we performed the classification in the slice level. As you understood, we created samples from all subjects' fMRI time series. Next, we shuffled the data and created training and testing samples. The reported accuracy is for the slice level classification. One more thing I'd like to share with you is slices from a given subject are independent even they are highly correlated. It means our training and testing datasets are completely independent from each other in "slice-level". I realize the medical imaging researchers are more interested in "subject-level" classification that's why I continued the project.
In our more complete paper called DeepAD, we performed "subject level" classification which means we divided the subjects into two groups : training and testing and then we did the "subject level" classification. Again, we achieved a very high accuracy rate. We also designed a decision making algorithm to stabilize the prediction process in order to make a decision whether a subject is Alzheimer's or not.
The beauty of CNN architecture is to generate a well-generalised model once it was successfully trained and validated by high volume of data. Please feel free to have a look at the DeepAD at http://biorxiv.org/content/... where we used a huge dataset to classify slice-level, subject-level structural and functional MRI data.
Hope it helps,
Thanks for your interest in this paper.
There are different ideas regarding using ROC / AUC as the classification metric or not. I personally have more tendency to stick with the accuracy as the performance metric of classification since it makes more sense in our research field. The reason, I generated the ROC curve was to ensure that the classification has been successfully performed. In addition, ROC / AUC is more informative in case of imbalanced data which is not the case in our work.
Thanks for your comment and offer. I will keep your contact information.
Thank you for the elaborate response and the link to the DeepAD paper. After reading it, I'm still not clear how the subject-level classification was done and I would appreciate some clarification since the accuracy you report is pretty remarkable.
In section 6 of that paper you mention that "the adopted LeNet model and GoogleNet were adjusted..." - by "adjusted" do you mean "fine-tuned"? I.e. have you used the networks you previously trained for the slice-level classification and fine-tuned the last layers on the subject-level task? If so, I'm assuming that none of the slices from the subjects chosen for the subject-level classification were seen by the networks during the training stage of the slice-level classification. This would leave very little data to fine-tune with as you only have 52AD/92NC for rs-fMRI and 211AD/91NC for MRI.
Can you please provide more details as to what layers you fine-tuned and how much data was used for this purpose?
1 - We trained the networks from “scratch”. “NO fine-tuning” was performed and it has been clearly mentioned in the paper. The initial LeNet and GoogleNet architectures have been designed for different number of classes but I used them for a binary classification so that some adjustment were required.
2 - If you take a look at the pipeline and data conversion section in the paper, I explained how I extracted 2D slices from the data to generate a huge dataset for both fMRI and MRI pipeline.
As I said in my previous comment, for “subject-level” classification, the “subjects” were divided into the training and testing group. Next, the 2D slice samples were generated. It means there was no slice from the same subject in the training and testing datasets, simultaneously. In another word, the training and testing data had “NO” subjects in common.
Regarding the accuracy reported, some research groups using different strategies could achieve a very high accuracy rate and I mentioned them in the literature review. Please look at the comparison table. However, I could improve the accuracy rate for MRI data by much more accurate preprocessing and some tricks in DL. In addition, for the first time, fMRI used for this classification and thanks to a very accurate and massive preprocessing pipeline and certain optimization, I achieved the highest accuracy rate reported so far.
Hope it helps. If it is still unclear or you need more clarification, I will be more than happy to help you or anybody else to replicate the DeepAD paper and achieve the same accuracy as long as you use exactly the “identical” methods I used in the paper.
You can reach me at firstname.lastname@example.org.
Thanks for your quick response, Saman! For subject level, I understand that you do the subject level separation prior to generation of the 2D slices for classification, but since your test set include multiple slices from the same subject, how do you calculate subject-level accuracy? Do you average accuracy across all slices of the same test subject? Or is it still slice-level accuracy?
That’s my pleasure to help.
In the subject-level, the accuracy reported is still based on the slice classification. How can we report an accuracy rate for a subject? It does not make sense at all. What you could do is to measure the probability of the subject whether to be AD or NC. Does it make sense?
What I developed was a decision making algorithm that was counting the number of slices classified as AD or NC and then calculate the probability and vote for the majority. Let’s say (the number is just an example) , for a given subject having 1000 slices, 900 slices were recognized as AD which can say the probability is 90% to be AD. In this case, the decision maker votes for AD.
In DeepAD, the table 5 and figure 15 summarizes what I explained above.
Feel free to post your comments or reach me out if you have more questions.
That is very nice paper and useful report for every user including me as a beginner in deep learning. My question is the accuracy of 97% is the best accuracy you got from your data?
This is the averaged accuracy after 5 time shuffling the data in this conference paper.
As I showed in DeepAD , by updating the preprocessing pipeline and adding more samples for the training, I could achieve up to 99.9% for slice level recognition.
Hi. I really admire your work and I am trying to replicate it as well. I want to ask regarding table 1. what is actually the volume is? I was taking a volume as one .nii image. but now I am totally confused about getting huge number of total images.
Is there any other good tool to replace fsl-VBM to get GM as I have tried it and it is taking a lot of time.
Dear Ammarah, Thanks for your interest in this paper and the expanded version DeepAD.
Let me answer your both questions in this reply. I think there is a misleading here, what we tried in this paper was "functional MRI" data which are 4D data (3Dxtime) . The volume in table 1 means 3D volumes of a given subject have been collected 300 times (time points).
Regarding your second question, actually FSL VBM is a tool for structural MRI not functional MRI. You can also use SPM8 to process you structural MRI data.
please also make me clear about selecting slices for sMRI. I get about 256 slices for one subject/nifti image. then I discard slices from start and end which are just black and have no information.it gives me about 70 to 100 useful slices but having different brain portions like very small from top as well as good looking axial slices. Am I doing it correctly? Also have you used data augmentation for sMRI.
good job. thanks for sharing your experineces.
do you write Roc curve code in digit? How can I access to Roc curve and confusion matrix in digits ? can I addany code?
Hi there, thanks for your interest in this work and paper.
Actually, what I did to generate ROC curves was done out of DIGITS . Firstly, you need to use classify many option of DIGITS to get predicted labels and scores for your testing samples. Next, you need to save the results as html files or any format (that you are more comfortable with) and write your in-house codes to draw ROCs. I did it in MATLAB.
thanks a million
Dear Prof. Sarraf,
Thanks for your work. I may ask a very stupid question. In your original paper, you just spit the data into training and testing dataset, and in the original paper, you gave the loss and accuracy figures over the 30 epochs for both training and testing. From what I understand, in the original paper, you 'testing data' is actually used to validate your model, not really a testing dataset. How did you calculate the accuracy of your testing data??
In the post here, you split the data into 3 dataset, training, testing and validation dataset. You mentioned 'We repeated the entire dataset generation and classification process five times for 5-fold cross validation, achieving an average accuracy rate of 96.85%.' So this accuracy is based on the real testing dataset, not like in the original paper, no?
Look forward to your response
The accuracy rates reported in this tutorial have been extracted from the original paper. No matter what if you use testing or testing / validation datasets for evaluating your model against unseen data , the evaluation is completely valid.
Hope it helps.