Webinar: Understanding PT Statistical Analysis and Evaluation

Christy Abbas (00:04):

Hello, everyone, and welcome to our webinar today entitled Understanding PT Statistical Analysis and Evaluation. Your speaker today is Craig Huff. I am Christy Abbas, and I will be your moderator. Before we get started with the webinar today, I'd like to go over a few housekeeping details. What you see there on the slide is the console that you're looking at. In the center are the slides that Craig will go through. There on the right-hand side is Craig's bio. Below that, we have a survey and we'd love for you to take a few seconds after the webinar to fill that out for us. There on the left-hand side, we've got the Q&A chat box.

(00:53):

So, if you have a question during the webinar that you'd like to have us answer during live Q&A, please put that question in that chat box, and then below the chat box, we've got a resource listing there for you and that includes a link to the eDATA page, our PT study schedule, our website, and then you can access Craig's slides there as well. Our webinar today is Understanding PT Statistical Analysis and Evaluation. This webinar will provide a comprehensive overview of statistical analysis and evaluation methods used in proficiency testing. The primary focus is on the models and techniques employed by the TNI or The NELAC Institute, which is widely recognized in the United States.

(01:49):

Key topics will include the use of Z-scoring models for trend analysis, pure study consensus approaches, and the application of regression equations and fixed limits for data evaluation. We will also cover the importance of robust statistical techniques. Additionally, we will discuss the Grubbs' test for outlier detection and the criteria for acceptance limits. Craig will conclude with a discussion on the challenges of multimodal data distributions and the tools available for monitoring PT performance, including Z-Scores and custom export reports. Your speaker today is Craig Huff. Craig is a senior manager with 35 years of experience in the environmental testing industry.

(02:40):

Over the course of his career in the environmental field, Craig has served in many roles in the following areas, analytical chemistry, customer service, marketing support, production planning, and senior operations management. He has extensive experience in analytical chemistry, statistical analysis, product development, and regulatory compliance. Craig has 29 years of experience working for Waters ERA and played a lead role in the design, development, and evaluation of the PT programs and CRMs that Waters ERA now offers. He has conducted technical and product training seminars designed to support government, commercial, and municipal environmental laboratories.

(03:29):

He currently serves on multiple NELAC committees in support of proficiency testing and laboratory accreditation. Craig has a BA in geology and an MBA. He's currently the senior technical manager at Waters ERA in Golden, Colorado. You'll be in very good hands today, and so with that, I turn it over to Craig.

Craig Huff (03:58):

Thank you, Christy, for that introduction, and welcome everybody to today's webinar. The topic today will be, as Christy mentioned, PT statistics approaches, and I want to go over some key points, some key techniques that are used by PT providers. Hopefully, when you come out of this, you'll have a better idea of how to look at your PT data before you enter results and before you submit for your evaluations. This should be a valuable tool to help you assess your performance in PT studies as well as your day-to-day analytical approaches. So, a quick outline for today is we're going to look at consensus approaches to PT studies. We're going to go over some applications of regression equation-based limits and fixed limits for evaluation criteria.

(04:51):

I'm going to also cover assigned values, how PT providers assign values for various analytics. I'll touch base on some robust statistical techniques that are utilized. I'll also be talking about an important topic, PTRLs. A lot of people tend to overlook this, but I'll re-emphasize the importance of knowing what yours are and how to apply them when you're entering PT data. I'll also be talking about some tools available for monitoring your PT performance. These will be tools exclusive to ERA. After which I will discuss multimodality and datasets and how we address those in terms of PT evaluations, and I'll finish up with some Z-Scores definition and why that's a useful tool to use to trend your PT results. Okay.

(05:48):

So, let's start off here with some commonly utilized models in environmental testing today in the US. From a statistical analysis evaluation perspective, there's really three primary approaches. There's the NELAC or TNI approach, Z-Scores, along with study consensus approaches. For the purpose of today's webinar, we're going to focus primarily on the TNI approach as that's what dictates a lot of the evaluation criteria as well as the PT requirements for environmental laboratories here in the US. Okay, so let's start with what happens when a PT study officially closes? Well, on that date and that time, the study gets locked down and then we apply some general statistical models to evaluate the data.

(06:38):

I should mention that depending on the analyte and the number of data points, the model can change. So, I want to clear up any confusion that might exist on the difference between robust statistics and simple arithmetic statistics. So, for robust, which is the primary technique that we use today, it's utilized for sample sizes of 20 or more data points for a given analyte. What this is, it's a multi-iterative bi-weighted mean and standard deviation that we calculate from the data set. So, what is bi-weighted and how is it calculated? Well, quite simply, it starts by calculating the median of the data set. That's the median, not the mean.

(07:22):

It assigns weighting factors to each data point from that median depending on how far out it is to each of the data points based on the distance it is from the median. We do 15 iterations of this to arrive at the robust mean, and then we do the same thing for the standard deviations. So, you ask, "Why would you use a robust technique?" Well, quite simply, it minimizes the effect that data outliers may have on a mean or a standard deviation within a given population. The other option that we have available is arithmetic. Now this is just a simple average and simple standard deviations. These are used for sample sizes between 7 and 19 data points typically. Again, as I mentioned, the robust technique is more reliable and it paints a more accurate picture of the actual data set.

(08:18):

I finished up the last slide talking a little bit about arithmetic means and arithmetic standard deviations. A key point to note here is when you apply that tool, you have to also evaluate for outliers and you have to determine which outliers are present and how you're going to treat them. Here at ERA and for most PT providers, they utilize the Grubbs' test. It's a common simple test. I won't get into the actual definition and the formulation of it very much, but I will say that this was only used when we were using arithmetic techniques because as I stated, robust techniques weight outliers out of the scenario. So, we use the Grubbs' test for sample sizes of 7 to 19 samples, but we also apply a rule that no more than 20% of the values in that data set can be classified as outliers.

(09:19):

So, now that we have an understanding of the statistical models used in proficiency testing, I want to move next to the TNI or NELAC Institute, and this is an excerpt of the NELAC Institute field of proficiency testing table for non-potable waters. I'm going to use this in as example on the next slide to show you how we calculate out both fixed limits evaluations and regression-based evaluations. So, if you look at this table, you see these A, B, and C factors. These are the regression equations that we have to apply to each analyte in each study to calculate out, for example, A and B, we use to calculate out a predicted mean. C and D factors, we use to calculate out the predicted standard deviation.

(10:07):

Now, in my example, I'm going to use nitrite as in highlighted in green, and you'll notice these regression equations across here as I apply them to the assigned value and how the evaluation criteria get determined. Before I move on though, here's an example of a fixed limit acceptance criteria, and this is for orthophosphate as P. Quite simply, you take the assigned value, you multiply it times 0.85 and 1.15, and that's your acceptance criteria that are applied to all the data points in the data set. So, let's take a look at the example that I talked about in the previous slide. We're going to use nitrite as nitrogen and we're going to take you through how the acceptance limits are calculated, how the predicted mean and standard deviations get calculated.

(10:57):

So, as you recall, these A, B, C, and D factors here, those are taken from the TNI FoPT table. So, what we do first is we assume that the PT sample assigned value in this case is one milligram per liter. So, based on weights and measures, the PT provider made nitrite as N to 1.00 milligram per liter. So, the first thing we do is calculate the predicted mean. So, we take that one milligram per liter times the A factor of 1.0017 plus the residual of -0.030. That gives us a predicted mean for the study of 0.999 milligram per liter for nitrite as N. The next thing we do, we do the same thing for the standard deviation.

(11:48):

So, we take the made-to value or assigned value of one milligram per liter times the C factor 0.0377 plus the residual 0.0250, and that gives us a predicted standard deviation of 0.0627 milligrams per liter. Now, keep in mind for the NPW or non-potable water, wastewater commonly referred to as, we apply three standard deviations, this standard deviation times three around this predicted mean. So, that gives us three times 0.0627 here, and that yields an acceptance range of 0.811 to 1.19 milligram per liter. Now, for drinking water, we multiply that by two. For the solid chemical materials or soils, we also use three for the predicted standard deviations to calculate out the acceptance criteria.

(12:46):

The benefit of these regression equations is it takes into account the analytical bias that were used to generate these equations here. A lot of PT provider data went into these statistics to generate these regression equations. So, in theory, it takes into account the most common analytical biases that are used for environmental protection agency methods used today. Let's now talk about PT sample concentrations, proficiency testing reporting limits or PTRLs and their impacts on acceptance limits and overall evaluation criteria. So, when we apply regression-based acceptance limits, we'll note that they'll typically change as a percentage of the assigned value over the PT concentration range.

(13:35):

As a general rule, the acceptance criteria will widen as a concentration range or the concentration of an analyte approaches the low end of the prescribed range. Whereas with fixed limits, they'll yield the same relative percentage across the entire concentration range. So, you're not seeing any real bias in that situation. Now when we talk about proficiency testing reporting limits, PTRLs, as defined in the volume three of the NELAC standard, the 2016 standard, NELAC defines it as a statistically derived value that represents the lowest acceptable concentration for an analyte and a proficiency test sample if the analyte is spiked into that proficiency testing sample. The PTRLs are also specified on the TNI FoPT tables.

(14:28):

One thing to note is PTRLs are not the same as method reporting limits, limits of detection, or method detection limits. They're distinctly different and apply only to proficiency testing concentration ranges. Now, a key consideration that need to be aware of for PTRLs is while you are not technically required to be able to quantitate down to the PTRL for any given analyte, however, your analytical method should be able to quantitate down to these levels to give you some added assurance that you can properly report a result should the PT provider have an assigned value at or very close to the lower end of the concentration range.

(15:11):

Noting that acceptance limits can extend below the lower concentrations in these situations, it's in your best interest to be able to see below that lowest concentration range and down to that PTRL. Moving on to the topic of assigned values and how they are determined by the PT providers, there's really three basic techniques used to calculate on an assigned value for PT samples. So, first one is it's an actual made-to value as determined by the PT provider using their mass and volumetric measurements and also taking into account any chemical impurities or the substrate purities, so to speak. The other way to do it is to just use measured means that are established by the PT provider through their own internal analyses.

(16:04):

This applies a lot more to SCM FoPT table analytes because it's based on methods that don't typically quantitatively cover the analyte in a given sample. The last one is we use the PT study mean. We use this when it's specified in the FoPT table. When we do this, we only use the C and D factors. That is the factors used to calculate the predicted standard deviation. Again, these reside on FoPT tables and PT providers must utilize these rules and adhere to them to calculate out their assigned values. The other thing you need to be aware of from the PT provider side is that we have to be compliant with verification, homogeneity, and stability testing criteria around the assigned values.

(16:58):

As a general rule, PT providers need to be able to quantitate from an accuracy perspective to one-third the criteria that the PT participants are required to be evaluated against and stability criteria to one-sixth. So, it's a lot tighter. For that reason, PT providers often have modified methods based on the EPA methodologies that are designed to be able to accurately quantify and obtain good precision data for samples in PT studies. Now, the other thing that I don't mention here but it's worth mentioning is that a lot of the organic analytes, for example, volatiles, base neutrals, and acids where the PT providers don't spike every analyte into each sample for a given study, PT providers have to be able to quantitate accurately down to one-half the PTRL.

(17:58):

This helps ensure that any false positives don't get taken into count when evaluating laboratory data. Basically, you don't want to have a false positive impact the evaluation criteria for any one laboratory or multiple labs. One statistical technique that we also use that I haven't mentioned before but it's very important to understand is multimodal data. What is it and how do PT providers handle it? Well, multimodal distributions can occur when you have two or more data distribution scenarios that are exhibited within a given data set. Each PT provider has to have a method approved by their PTA for detecting and treating these situations. Their PTPA is a proficiency testing provider accreditor.

(18:54):

In ERA's case, that's A2LA, and they have approved how we address multimodal data. So, when a PT provider does detect multimodality in a data set, they must assess the cause, segregate the data, and evaluate it separately. Or if they can't determine the cause, they can have it as an option to invalidate the entire sample for that PT study. It's not something that we like to see happen, but it can happen particularly in small data sets. Some potential causes of multimodality that we look for are preparatory and are analytical method biases that is two or more methods may not yield equivalent performance characteristics. In that case, you can get a multimodal distribution. Also, you look at PT sample in homogeneity within the sample and between the samples.

(19:50):

This is important and this is why PT providers have to ensure that their samples are homogenous and for the intended use, i.e. in this case, proficiency testing studies. Now, when you're looking at aqueous samples, which for the most part by design are homogenous, it really doesn't come into play much. However, when you're looking at soil samples, particularly when you can sub-sample out of a soil PT sample, that sample has to be homogenous both within each bottle as well as across the entire batch of bottles. So, that's something else PT providers have to look at and evaluate. Another cause of multimodality can be attributed to instability in that sample during the course of the study.

(20:42):

So, if you have an analyte that's starting to degrade before the study closes, that can result in a multimodal data distribution. ERA has tools to look at that in terms of based on what dates were reported for the analytical results and when they were entered. We can take a look at that. It rarely happens, but occasionally, it can happen. When it does, it manifests itself in a fairly multimodal scenario. Okay, let's take a look at some tools that we have available for you, tools that you can use for PT monitoring and for trending your PT results that are currently available in our eDATA system. Two reports that we currently have are the PT performance report and the PT exceptions report. The PT performance report is a report that just pulls out all the data that you received an evaluation on.

(21:42):

Whereas on contrary, the exception reports presents you with the analytes that you did not get an acceptable evaluation on. These are two pretty good reports, and I will note that they both present Z-Scores for you as well. So, you can take a look at your Z-Scores and plot and trend your data over time, your PT data. We also have a custom export generator available to you where you can define and save the data that you want when you want it and you can do that for each study or you can look at multiple studies when you're generating data out of that export generator. Now, I mentioned Z-Scores. The Z-Scores are a pretty powerful trending tool. So, they allow you to know when you have an opportunity to improve before you experience a not acceptable evaluation on a PT result.

(22:33):

All of these tools can be used. They're all available to you currently today, and I would encourage you to do so. Give them a try. I mentioned Z-Scores in the previous slide and what a powerful tool they can be. First question is, what is a Z-Score? Well, quite simply, it represents the distance of your result from the mean of the data set. It's expressed as a standard deviation in the same units that you reported in. The formula is pretty simple. The Z-Score equals your result minus the mean from the study data divided by the standard deviation from the study data. So, if you have a negative Z-Score, this represents a result that fell below the study mean. If you get a positive Z-Score, it represents a result that obviously that's above the study mean.

(23:26):

For evaluation purposes, Z-Scores typically less than or equal to two and sometimes less than equal to three are applied for performance depending on the study type. So, for example, drinking water's Z-Score of two or less is an acceptable result. For wastewater or NPW PT studies, we apply three standard deviations. So, if you get a Z-Score less than three, you get an acceptable result for a wastewater or a non-potable water study. So, we've covered a lot of information today. There's a lot to chew on. If you have any questions, feel free to reach out to me directly or you can contact me through the ERA website, eraqc.com. Here's some other information and websites that you can also look at.

(24:21):

I would strongly encourage you to take a look at The NELAC Institute website or TNI and take a look at the FOP tables for each of the different types of studies. There's a lot of information there also about laboratory accreditation and additional PT program information. Again, you can get that information from nelac-institute.org. There's also a lot of criteria built into the ISO's 17025, 17043, 17034 standards, which ERA maintains accreditation to all three of these standards as well as the 2016 TNI standard. So, that concludes the webinar for today. I thank you. If you have any questions, please feel free to ask. Thank you again. Bye.

Question 1:
How do you determine acceptance limits for a given method?

Answer 1:
As I stated in the webinar earlier, the acceptance limits are going to be derived primarily from the NELAC FoPT table. PT providers are required to follow that criteria, and any deviation from that has to be approved by their proficiency testing provider accreditor.

Question 2:
When a provider does not send Z-Score information, what is the best way to derive it?

Answer 2:
If you don't get a Z-Score from your PT provider, you can calculate it based on the formula provided in the webinar. Alternatively, you can use a predicted standard deviation calculated from an FoPT table or historical standard deviation and mean information from the PT provider. ERA's approach is to use the actual study mean and the actual study standard deviation to calculate Z-Scores.

Question 3:
Why does Z-Score criteria vary per provider and/or method?

Answer 3:
Z-Scores are derived from study means and study standard deviations. When those change, Z-Scores also change. Additionally, the size and tightness of the data sets impact Z-Scores. Some PT providers might use predicted standard deviations and means when data is sparse for a given analyte, whereas ERA prefers to use actual consensus data for each analyte in each study.

Question 4:
Are the coefficients obtained by adjusting all the data including customers? Is it a nonlinear adjustment?

Answer 4:
The adjustment is typically linear.

Question 5:
Why are the regression equations simple linear equations with only two terms for mean and standard deviation, and not more complex regression equations?

Answer 5:
The model chosen by TNI for the FoPT table uses simple linear equations for both the standard deviation and the mean. This statistical approach was prescribed by TNI and is based on data submitted by PT providers to develop these regression equations or fixed limits.

Question 6:
Is it the same calculation for Z-Scores for microbiology?

Answer 6:
Microbiology is slightly different. You need to take the log of the data, the log of the standard deviation, and the means to calculate Z-Scores for microbiology. Then, take the log-normal transformation.