As the number of available imaging AI algorithms grow each month, the ability to truly validate a model’s performance and use that validation to enhance its clinical and operational performance has arguably become more important than the study-based accuracy claims that had everyone so impressed just a few years ago.
You could say that we’re at the “prove it and improve it” phase of the imaging AI adoption curve, which is what makes Qure.ai’s recent algorithm validation partnership with MEDNAX and vRad so interesting – and so important.
In this Imaging Wire Q&A, we sat down with Chiranjiv Singh, Qure.ai’s Chief Commercial Officer; Brian Baker, vRAD’s Director of Software Engineering; and Imad Nijim, MEDNAX Radiology and vRad’s CIO, to discuss the origins and results of their efforts to validate Qure.ai’s qER solution “in the wild.” Here it is:
The Imaging Wire: How did Qure.ai and MEDNAX come to work together?
Brian Baker: To explore the Qure.ai and MEDNAX partnership’s establishment, a quick history of the MEDNAX AI incubator is important. MEDNAX has been working with AI partners since 2015 in various forms with the primary goal of improving patient care. Qure.ai was one of the earlier partners in that process. Before the incubator was officially launched in 2018, Qure.ai was already collaborating on advanced solutions.
One important thing we bring to these AI partnerships is our massive and diverse data. MEDNAX Radiology Solutions has 2,000-plus facilities in all 50 states. We have radiologists all across the country reading over 7.2 million studies on the MEDNAX Imaging Platform. We have an enormous, heterogeneous data set. The data is not only representative of a very diverse population, but also a very diverse set of modality models, configurations, and protocols.
My primary focus for AI at MEDNAX Radiology Solutions is first and foremost patient care – helping patients is our number one goal. But also important, we want to foster a community of AI partners and use models from those partners in the real world. A big part of that is building models and/or validating models.
Qure.ai came to us with models already built on different data sets. They didn’t need our data set to perform additional model training, but they wanted to do real world validations to ensure their models and solutions were generalizing well in the field against an extremely large and diverse cohort of patients.
That is where the relationship blossomed. Our partnership first focused on the complex aspects of how we see different use cases from a clinical standpoint; we very much align on both use cases and pathologies; this alignment is a critical step for everyone – AI vendors and AI users in radiology alike. The clinical nuances to using a model in production are incredibly intricate, and Qure.ai and MEDNAX’s convergence in this area is a large part of our success.
Chiranjiv Singh: From our inception as a company, there was a clear emphasis that Qure.ai as a brand has to stand for proving the applicability of our product in a real-world context. And, for us to make a significant impact for our customers and their patients, the results have to be highly measurable. This implies that our solutions need to be extensively tested and credible at every level. To achieve this degree of validation requires a high volume and variety of independent data sets and it also needed us to expose our algorithm to rigorous test conditions.
That is where our strategic goals aligned with MEDNAX’s goals – and, together, with the MEDNAX team, we started calling this validation exercise “testing in the wild.” The Qure.ai team saw the value of partnering with someone of MEDNAX’s size and caliber to drive the variety, volume and rigor to help us validate every aspect of our solution. Without us leveraging the scale and the volumes of MEDNAX, we would never been able to achieve it in that short period of time unless we worked with roughly 100 different hospitals in the U.S.
What made the partnership stronger was the caliber of the MEDNAX team and the overall platform that they provided for us to jointly learn and improve. And, for these reasons, a very strategic alignment came about for both our teams, jointly working to make this “validation in the wild” a successful project for us both.
Brian: I believe only half the problem is proving your sensitivity and specificity with a large, diverse patient cohort. That is obviously extremely important for clinical and ethical reasons, but the other part of the problem to solve is figuring out how to ensure that a solution or model works on all the various types of DICOM in the industry. At MEDNAX Radiology Solutions, we see everything in DICOM that you can imagine and some you would not believe. That might be anything from slightly weirdly-formed DICOM to data in non-standard fields where it shouldn’t be or secondary captures or other images inside of the study, down to all the protocols involved in imaging (how the scan is actually acquired). With our scale and diversity of data, a model that can operate without erroring and crashing through a single night is an engineering feat on its own.
The Imaging Wire: Brian, can you share about the test, the results, and takeaways?
Brian: We’ve taken Qure.ai’s container-based solution that includes the AI models and plugged it in MEDNAX Radiology Solutions’ own inference engine. In our inference engine, image studies flow to models/solutions that are in a validation run in nearly the same way that the models/solutions will run if they successfully pass our validation. The major difference is that during validation, the results of the models do not initiate any action in the MEDNAX Imaging Platform – instead we just gather the data.
As imaging studies flow through the inference engine, we capture the results along with the results of Natural Language Processing (NLP) models run against our clinical reports (from radiologists). This allows us to very quickly determine how a model is doing at MEDNAX scale. We compare the NLP results to the Image AI results and have a very good understanding of how the model is performing within the MEDNAX Imaging Platform.
My team monitors all models on a continuous basis. For models being validated, this data is what makes up the core basis of our validation process. For models that have already been validated, this continuous monitoring ensures that models remain within approved thresholds – if a model successfully goes through our validation process and is approved by clinical leadership, it is important that the model continues to operate with the same sensitivity and specificity. If for any reason the data changes (patient demographic makeup, image acquisition changes, etc.) and the model no longer performs to our standards, we are alerted and remove that model from usage on the clinical platform.
For a validation run, we typically run a model for two weeks, and then capture those two weeks of data for further evaluation. The Qure.ai model has been running for several months to make sure it is hardened and successful. There were 300,000 studies that passed through when we looked in October. While the validation set is only 2 weeks of data, Qure.ai’s model held a consistent sensitivity and specificity throughout the process of integration.
For the validation evaluation, we built a validation document for Qure.ai that explores not only sensitivity and specificity against various NLP definitions, but also smaller hand-reviewed sub-cohorts as well as added analysis focused on sex and age breakdowns.
The Imaging Wire: What were some of the key takeaways for Qure.ai in terms of validation and learning about how your models performed “in the wild?”
Chiranjiv: We learned a great deal as a result of going through this process. A lot of work went into the back-end R&D process – in terms of relooking at our data science models and engineering analysis and really pinpointing where the weak points are and where the model can potentially break down. Our team was able to use the feedback and look at real clinical cases to fix these shortcomings and test them again with constant feedback models coming in through MEDNAX. This has made our solution more accurate, our predictive analytics sharper and our engineering ability far stronger than when we started out. Having the ability to go through the exercise of assessing 300,000 exams in a performance evaluation is a powerful proving ground. We are confidently sharing this with our customers by pointing out the fact that “the accuracy or performance of a model is only one part of fulfilling the promise of making AI real.”
The way the MEDNAX Imaging Platform is set, it’s like getting nearly real time, or live feedback on potential areas of error, improving the model and seeing your false positives and false negatives reduce with every round of testing. We learned so much looking at the variety of data, different kinds of DICOMs, incorrect DICOM tags, diverse acquisition protocols, every possible CT manufacturer, varying slice thicknesses, etc. Even though we had a lot of that before this partnership, this experience gave us an opportunity to bring stronger products to the market.
The next step for us is to share this with our potential customers and leverage this partnership to further spread the word that “making AI real” is not just about algorithm accuracy. Yes, accuracy is a critical piece – but if for example, you’re not beating speed requirements (like those vRad and MEDNAX Radiology Solutions had) then there is no point to take 10 minutes to read a CT when the entire turnaround is less than 10 minutes.
As a result of this partnership, we have made significant strides in our journey from innovative data models to working AI products. The Qure.ai team now has the ability and the confidence that, if any large client wants to deploy “AI in the real world,” we have the expertise and experience in handling the kind of volume and variety that we would have never experienced without working with vRad and MEDNAX Radiology Solutions.
The Imaging Wire: Many in the AI research community highlight a need for multi-center prospective studies, what role do you think this type of partnership can play in the absence of these studies or as a contributor to these studies?
Brian: I view MEDNAX Radiology Solutions’ role in the AI community as a mandate to help companies such as Qure.ai run large multi-center validations. Often, the community at large views this type of validation as important due to the diverse population of patients. And while I agree that is incredibly important, it is worth noting that it is also important to validate against various DICOM implementations and image study acquisition parameters.
Imad Nijim: There is obviously a lot of research going into this and the academics are very active with this work. For us, a big focus is on the real-life implications of this, and there was really hard work on both sides. One of the first steps was defining intracranial hemorrhage, and MEDNAX and Qure.ai had different definitions that they had to reconcile. They had to dig into the minutiae of their definitions and their results went into the AI model and imaging model that they built together.
Chiranjiv: This was not a validation study with one institution that has a standard protocol, defined patient profile, limited device inputs, etc. This is the fastest and closest you get to a multi-center study as the exams are coming from 100’s of different medical facilities across the country. MEDNAX gave us the ability to validate the algorithm with a diverse data set, different user settings, equipment types and all the other variability that a multi-center study would offer.
The Imaging Wire: Do you have any final thoughts on this partnership?
Chiranjiv: During this experience there was clear alignment on identifying the end value. We both realized that this project is not just about improving accuracy. If this is done well, it will influence decisions that directly impact patient lives. Most of the clinical cases involved CT scans being read as part of night services for medical facilities across the U.S. Many of these facilities, especially the smaller community-based hospitals, may not have experts to read these exams, especially during late-night hours. Our team had the context that if we do all this hard work to get the engineering, accuracy, and clinical definitions right, it positively impacts the patient. We can be the catalyst that makes the difference for that one patient. That has to be the north star. And this vision was what aligned Qure.ai and MEDNAX in the first place –and it’s what drove us to really get this right.
Imad: People that focus on the technology aspect of AI will get tripped up. The questions that people need to ask are: What problem are they solving? What workflow are they optimizing? What condition are they trying to create a positive outcome for? These are the questions that we need to ask and then back into it with the technology component. It sounds simple, but a lot of people don’t understand that and it’s a big differentiator between the successful and unsuccessful companies.