Power BI & R in a Public Health Lab: Automating Lab Data Analysis
The problem: data drudgery.
Our solution: business intelligence tools.
Two business intelligence tools are available for use at the Napa-Solano-Yolo-Marin Public Health Laboratory (NSYM PHL): Power BI and R. I described how the laboratory staff use these tools during my presentation to LabAspire Fellows and their mentors. I'm sharing part of it here.
Context
The NSYM PHL processes over 6,000 specimens annually, with more than 75% being environmental specimens (i.e., water samples and ticks). The rest are clinical specimens. Once these specimens are in the laboratory, they undergo three phases of testing:
Pre-analytical phase
Information about the specimen is derived from the requisition form that comes with it: patient information (i.e., unique identifiers, travel history), submitter (oftentimes the medical facility) information, and sample information (i.e., sample source and type, collection date and time, type of test requested). The information is then entered into the laboratory information management system (LIMS). The specimen's suitability for testing is also assessed. Some questions that could be asked at this stage are:
- Is the primary container intact? Is the specimen leaking from the biohazard bag containing the primary container?
- Did the specimen arrive within the transport temperature and conditions required for the requested test?
- Was the specimen stored at the required or recommended storage temperature and conditions for the requested test?
Analytical phase
This is when Public Health Microbiologists (PHMs) and laboratory technicians generate the raw data.
PHMs observe the characteristics of microorganisms directly from a clinical specimen, often using microscopy. In most cases, they place the specimen onto culture media, which allows target microorganisms to grow. For example, suppose a physician wants to rule out tuberculosis in a patient with symptoms consistent with the disease (e.g., weight loss, night sweats, fatigue, coughing up bloody sputum). In that case, the physician may send the patient's sputum samples to the laboratory. PHMs then prepare the specimen by subjecting it to a mucolytic-decontaminant mixture to minimise the growth of bacteria typically found in the respiratory tract. Then, PHMs place the processed specimen in three culture media and incubate them for the recommended time. The culture media are monitored for bacterial growth over time. Any bacterial colony is then observed microscopically for characteristics consistent with the target microorganism (i.e., acid-fastness, rod shape). Then, the bacteria with these characteristics are grown as isolates for identification byMatrix-Assisted Laser Desorption/Ionisation Time-of-Flight Mass Spectrometry (MALDI-ToF-MS). Laboratory technicians, on the other hand, place nutrients for target bacteria (e.g., coliforms, Escherichia coli) into water specimens. They determine the presence of bacteria that indicate that the source of a water specimen is unsafe to drink (e.g., water well) or to swim in (e.g., beach) by observing the colour of the water specimen after incubation. They may also estimate the amount of bacteria in the water specimen. Laboratory technicians and PHMs record their observations in worksheets and in the LIMS.
Post-analytical phase
Reports are generated using the LIMS system and transmitted to the facility that submitted the specimen.
This phase is also when the laboratory summarises the results. The summaries are typically reported regularly to a state laboratory program (e.g., Healthcare-Associated Infections Program of the California Department of Public Health) or a county public health group. Data is also presented to the public health officers of the counties served by the laboratory. Sometimes, statistical analyses are required to provide data-informed insights about laboratory procedures.
Before I discovered that the laboratory has business intelligence tools, I observed staff toiling for days, manually copying information from the LIMS to Excel spreadsheets or diligently leafing through printouts of results already entered in the LIMS.
This irritated me because this workflow was slow, tedious, and prone to errors. In data science class, my instructors made sure that we learn to minimise repetitive work by automating at least some part of the process.
When I learned that Power BI was available county-wide, I built a few dashboards that could be easily updated as new data became available. Because the dashboards are simple to use (i.e., all one has to do is hit the "Refresh" button), it's easier for staff to learn them. R is also available, but not as widely as Power BI. Also, it's more challenging to teach staff with no coding background to use R, so its use is quite limited (i.e., I'm the only one using it for now). In other words, I use both tools; staff currently use just Power BI for report generation. At some point, I will start teaching them how to create dashboards so they can explore the data generated more freely.
Workflow
The general workflow for data analysis and visualisation, regardless of the business intelligence tool I am using, is:
- Data extraction from LIMS to a .csv file
- Data cleaning (especially removal of personally identifiable information and protected health information)
- Connecting or opening the data in the business intelligence tool
- Performing analysis, report generation, and/or data visualisation
Advantages of using business intelligence tools
Summarising "large" data sets
By "large", I mean that the data sets often have over 1,000 rows, and staff used to enter data manually in Excel. Effectively, they entered data twice because they didn't know that exporting the data they had entered in the LIMS was possible.
The learning curve was not too steep for the staff, particularly with the dashboards I had built in Power BI. The critical tasks were:
- Determining which variables had to be included in the data extraction step (I created a data browser/query in the LIMS for this)
- Cleaning the data to exclude proficiency test results (I incorporated a filter for this in the data browser/query)
- Cleaning the data to "test patients" (currently doing this manually on the extracted data)
- Reviewing the outputs on Power BI to determine if the summaries (e.g., counts, measures of central tendencies) make sense
These tasks usually take less than an hour because most of the work is handled by the LIMS and Power BI. Sometimes, generating summaries is more involved due to the conditions required. For instance, a one-off request for peak hours of an activity (e.g., sample accessioning, instrument use) would require me to include the time of data entry for that variable for each sample. In this case, the requested variables need to be added to the LIMS data query, and the Power BI code should be updated to include them as new columns. This is still manageable because it's just a matter of clicking buttons and filling in the blanks.
R is different because I have to write the code myself. I default to R when the task is more complex and the turnaround time is short, since I have used R longer than Power BI.
Making data-informed decisions
Once the summaries are reviewed, the thinking begins. What story is the data telling us?
For example, one of the laboratory's medical submitters wanted to dig deeper into multiple instances of blood specimens drawn from the same patient yielding differing results (i.e., conversions from positive to negative or vice versa), raising concerns about the inconvenience of redrawing blood (for the patient). The public health group examining the data also wanted to investigate the risk factors leading to patients being subjected to blood redraws. The laboratory's solution was to conduct a three-month study where:
- Patient's risk factors for taking the test were recorded (i.e., additional information in the order requisition form)
- If a blood specimen's result is positive despite the patient having risk factor(s) indicating a low probability of having a positive result, the laboratory would take an aliquot of that blood specimen for a second round of testing. This follows the manufacturer's recommendation and is expected to reduce blood redraws.
Over 1,000 specimens were received during those three months. We learned the following:
- Data showed that the test manufacturer's recommendation yielded the same results across all repeated tests. We concluded that it is not a practical solution. The public health group agreed.
- For patients whose blood was redrawn, conversions from a positive first specimen to a negative second specimen (or vice versa) occurred in specimens with results near the cut-off. This highlights a need for better guidelines (for medical practitioners) on how to interpret results near this cut-off point.
- In the three months of the study, the data indicated that less than 1% of the specimens received by the laboratory came from the same patient and had result conversions. This illustrates that, although annoying and potentially causing downtimes, blood redraws affected fewer people than initially projected. However, medical submitters and public health groups need to review the decision rules (e.g., whether risk factors are included) surrounding blood redraws to prevent unnecessary callbacks to patients.
Using the workflow above, I summarised the data and had it ready for review within about 2 hours of the results being completed. I used R in this case because I could perform more complex queries with it than with Power BI.
In another instance, I used R to analyse the results of about 2,000 water tests to see if the results were affected by the in-house sterilisation of the pipette tips used during the tests. I determined that a binary logistic regression was the best tool to use because of the type of data generated. I don't know whether MS Excel or Power BI can handle this type of analysis. The results indicated that there was no significant difference between the results of tests where pipette tips were sterilised in-house and those that were not sterilized in-house. As a result, we didn't have to request resampling from different locations for retesting (e.g., beaches, lagoons, groundwater, runoff, faucet).
Generating (semi)-automated reports
One of my favorite reasons for using business intelligence tools is that they make report and dashboard preparation easier. I have made numerous dashboards and report tables in Power BI that can be refreshed as often as you can get the latest data from the LIMS. All staff have to do is hit "Refresh" and the report is ready for review!
The laboratory must submit updates on test results for seasonal respiratory viral illnesses to the state laboratory weekly (especially during flu season). It also submits information about specimens that test positive for carbapenem resistance to the state laboratory at least once a month. Internally, the laboratory uses Power BI to generate tables for each patient with Mycobacterium sp.-positive specimens. These tables are critical in understanding which new specimen from a patient gets tested for organism identification (at least 30 days after the collection date of the last positive specimen) and drug susceptibility (at least 90 days after the collection date of the previous specimen sent to the state laboratory).
The process is semi-automated because most tasks are completed with a single click on the "Refresh" button, with the manual components being data extraction, exporting, and data cleaning.
Challenges of using business intelligence tools
Though it offers many advantages, the use of business intelligence tools in NSYM PHL has not been widely adopted previously. Power BI and R are available upon request. Once either (or both) of the software packages is installed on their computers, staff face a learning curve whose steepness depends on their previous experience with databases and coding, as well as their willingness to learn something new.
Comments
Post a Comment
Thank you for dropping by!
Before moving on, please share your thoughts or comments about the post. :)
Thanks again!