限值设定时的异常值处理与数据分组考量
Handling Outliers and Data Grouping When Setting Limits
1. 引言 Introduction
此前,我们已经探讨过如何使用统计学方法建立环境监测(Environmental Monitoring, EM)的警戒限与行动限。而在实际执行过程中,两个经常被提出的问题是:如何进行数据分组,以及如何处理异常值。本文将对这两个主题进行进一步探讨。需要注意的是,关于异常值处理与数据分组的原则不仅适用于EM限值的设定,也同样适用于中间过程控制样品(In-Process Control, IPC)、原辅料等微生物限度的建立。
In previous discussions, we explored statistical approaches for establishing alert and action limits in environmental monitoring (EM). In practical implementation, however, two frequently encountered questions are how to appropriately group data and how to handle outliers. This post further examines these two topics in detail. It is worth noting that the principles discussed here are not limited to EM, but also apply to in-process control (IPC) samples, raw materials, excipients, etc., where microbiological limits are established.
2. 异常值的处理 Handling Outliers
2.1 业界争议 A Topic of Debate
对于在限值计算时是否需要剔除异常值,目前仍然存在一定的争议。Wilson于1997年发表于PDA Journal上的评论提到 [1],
When setting alert and action limits, whether outliers should be excluded remains a topic of debate. Wilson (1997) commented in the PDA Journal that [1]:
"Regardless of the method employed, however, a special concern is how one deals with outliers and clusters of unusually high counts, where the process was out of control, will lead to inappropriately high alert/action limits."
Wilson, 1997
"Such clusters of obviously out of control data should be excluded before alert/action limits are calculated."
Wilson, 1997
类似的,Roesti写到 [2],
Similarly, Roesti wrote [2],
"Outliers may be excluded from the calculations if there is a suitable justification."
Roesti, 2019
然而,问题的关键在于该如何判断"unusually high counts"?在缺乏清晰标准的情况下,这往往成为一个“先有鸡还是先有蛋”的难题。更重要的是,在当前数据完整性(Data Integrity, DI)监管日益严格的环境下,任何数据剔除行为都必须基于充分的科学证据并提供文件化理由,否则可能被视为数据操纵(data manipulation)。因此,对异常值的识别与处理应格外谨慎。
However, the key question is how do we determine what qualifies as “unusually high counts”? What criteria should be used, and how can such exclusion be properly justified? This is not straightforward and can often lead to a “chicken-and-egg” dilemma. Moreover, under the increasing scrutiny of Data Integrity (DI) requirements, any action involving data deletion must be approached with extreme caution. A lack of adequate justification and documentation could be regarded as data manipulation.
2.2 异常值测试的适用性 Applicability of Outlier Testing
在部分实践中,有人建议通过统计学方法识别异常值后再进行剔除。然而,需要注意的是,常见的Grubbs检验与Dixon Q检验均假设数据服从正态分布(normal distribution),而微生物监测数据往往呈非正态分布,多为右偏的离散型分布。因此,这些方法在EM或IPC数据中并不适用。
Some professionals propose using statistical tests to identify outliers before exclusion. It should be noted, however, that several commonly used methods, such as Grubbs’ test and Dixon’s Q test, assume that the data follow a normal distribution. This assumption is often invalid for EM and IPC data, which are frequently right-skewed and discrete; therefore, these methods are not suitable for most cases.
非参数方法如四分位距法(Interquartile Range, IQR)定义异常值为:
The Interquartile Range (IQR) method defines outliers as those:
x < Q1-1.5*IQR or
x > Q3+1.5*IQR
其中Among them, IQR = Q3 – Q1
该方法虽不依赖分布假设,但其本质上仍属于基于分位数的算法。若后续限度计算本身也基于分位数(如95%、99%分位法),那么此时的异常值识别与限度计算之间会形成循环论证(circular reasoning),降低统计解释力。此外,当样本量较小时,被识别为异常值的数据点可能仅仅是取样误差或自然波动的结果,并不是真正的异常。
Although the IQR method is a non-parametric alternative that does not rely on distributional assumptions, it defines outliers based on percentiles calculated from the data. This method, while simple, can be circular in its reasoning, since the percentiles used to identify outliers are themselves calculated from the same data that may contain those outliers. Furthermore, in small datasets, a statistically identified "outlier" may actually result from sampling variation rather than a true abnormality.
“Utilization of statistical tools to remove outliers or manual exclusion to smoothen the pattern is not recommended since the remaining data points would not represent the overall variability of data and calculated levels might be underestimated.”
Roesti, 2019
2.3 一种推荐做法 A Recommended Approach
私以为,一种合理且合规的做法是剔除高于规定限(法规最大限)的数据(如有,事实上很少存在这样的数据),并用剩下的数据进行限度的计算和分析。当使用合适的统计分析工具和合理的分位数时,异常偏高的结果对计算结果造成的影响是可以规避的。
In the author’s opinion, a more acceptable and scientifically defendable approach is to exclude only those data that clearly exceed regulatory limits (if such data exist, though in practice, these are rare). In addition, when suitable statistical tools and appropriate percentiles are used, the influence of unusually high results on calculated limits can be effectively minimized.
此外,其他可以考虑剔除的,明显不具有代表性的数据包括(均需要提供相关的说明和解释)[2]:
Other data points that may be reasonably excluded include (provided that there are rationales) [2]:
(1)已被确认存在偏差且完成根因分析与CAPA的数据。
Deviation counts that resulted in a clear root cause assessment and corrective and preventive actions.
(2) 工艺、设备或设施重大变更前的数据。
Older counts before a major change that affected the bioburden (e.g., upgrade and downgrade of a cleanroom).
(3) 原料供应商或质量等级变更前的数据。
Former excipient microbial enumeration counts following a change in supplier/quality grade.
3. 数据分组合并 Data Grouping
3.1 为什么需要合并 Why do me need to combine data?
在EM,IPC等检测中,单个采样点(或者某类样品)在给定周期内(如,一年)的样本量通常有限。而统计评估限值(如分位数法、公差区间法或分布拟合)通常需要一定的数据量以确保可执行性和稳健性(或置信度)。因此,基于活动或风险特征的对数据进行合理分组以保证数据量常常是确保统计有效性的必要手段。
In EM, IPC and other monitoring, the sample size from a single sampling point or a specific type of sample is often limited within a defined period (e.g., one year). However, statistical approaches used for establishing alert and action limits (such as the percentile method, tolerance interval approach, or distribution fitting) generally require a sufficient dataset to ensure statistical feasibility, robustness, and confidence. Therefore, grouping data based on activity characteristics or risk profiles to enlarge sample size is often a necessary strategy to achieve statistical validity.
正如 PDA TR13 所述:
As stated in PDA Technical Report No. 13:
“Areas that have the same activities may be grouped to provide more data points; for example, material airlocks of the same Grade, or Grade D hallways of the same function can be grouped.”
PDA TR13
“All the test data for a particular site, or group of similar sites, are arranged in a histogram, and the alert levels and action limits are set at values whose monitoring results are respectively 5% and 1% higher than the level selected (i.e., 95th and 99th percentile, respectively). ”
PDA TR13
换言之,当不同采样点之间在操作性质、洁净级别以及人员与物流模式上具有相似性时,为每个采样点单独建立限值意义不大。相反,将其视作同一统计总体进行统一的限度设定,不仅能显著提升样本量,也有助于提高计算结果的稳定性与代表性。
In other words, when sampling locations (sample types) share comparable activities, cleanroom grades, and personnel/material flow patterns, establishing separate limits for each location (sample type) offers little additional value. Instead, treating them as a single statistical population for limit calculation not only increases sample size but also improves the stability and representativeness of the results.
当然,另一种常见的做法是延长统计周期(例如使用两年或更长周期的数据),以弥补样本量不足的问题。然而,这一策略也存在一定局限:一方面,某些检测项目即使延长周期仍难以获得足够的数据量;另一方面,过长的周期可能引入代表性不足的历史数据,尤其在工艺、设施或管理策略已发生变化的情况下。因此,如前文所述,应谨慎评估数据的时间适用性,并根据实际变更情况剔除过时数据,以确保限值反映当前工况的真实水平。
Alternatively, one may consider extending the statistical evaluation period (e.g., using two years of data) to increase the dataset size. However, this approach has inherent limitations. Some test items may still fail to reach the minimum required sample size, while overextending the time window could compromise data relevance, particularly if facility upgrades, process modifications, or procedural changes have occurred. Therefore, as discussed earlier, older data should be carefully assessed and excluded if they no longer reflect the current process state, ensuring that the established limits remain representative and scientifically valid.
3.2 数据合并原则 Rules for Combination
更具体地说,在评估数据集是否可以合并时,可参考以下原则 [2]:
(1)数据是否来自相同的检测项目;
(2)数据是否来自设计或工艺特征相似的区域;
(3)数据是否来自理论上应具有相似微生物负荷的区域。
To be more specific, the following principles can be applied when evaluating whether datasets can be grouped [2]:
(1) the data is from the same testing item;
(2) They originate from areas of comparable design/process e.g., rooms with different grades;
(3) They originate from areas in which as similar microbiological burden is expected.
然而,需要特别指出的是,满足上述条件并不意味着就可以直接合并数据。例如,对于(1),如果两个检测项目均为辅料的生物负荷(bioburden)检测,但单位分别为 CFU/10 mL 和 CFU/30 mL,则不建议将两者合并,除非能够进行科学合理的换算。这种换算绝不能是简单的比例放大或缩小,相关讨论可参考 Young 等(2013) 的研究 [3]。
However, it is important to note that meeting these conditions does not necessarily justify data grouping. For instance, regarding criterion (1), if two test items are both bioburden tests for excipients but are expressed in different units (e.g., CFU/10 mL vs. CFU/30 mL), grouping is not recommended unless a scientifically justified conversion can be performed. This conversion should not be a simple proportional scaling; readers may refer to Young et al., 2013 for further discussion [3].
对于(2),即使房间属于同一洁净级别,也并不意味着可以直接合并。例如,清洗间通常具有较高的人员活动频率和湿度,因此即便与生产间属于同一等级区域,也可能不适合合并。类似的情况还包括清洁用具间。
As for criterion (2), even when rooms belong to the same cleanroom grade, grouping may still be inappropriate. For example, considering the higher activity level and humidity typically found in washing rooms, it might not be suitable to combine data from washing rooms and production rooms, even if they are both classified as the same grade area. Similar caution applies to the janitor room.
对于(3),举例:同为 D 级的仓库与更衣缓冲间,由于仓库不可避免地会接收来自外部的物料,因此其微生物风险水平与更衣间不同,也不宜合并分析。因此,在判断数据是否可以合并时,除了考虑洁净级别外,还应综合考虑理论上可能的微生物负荷差异。
Likewise, as an example, warehouses and changing rooms may be both classified as Grade D, but data grouping may not be appropriate because warehouses inherently receive more materials from external environments, which increases the potential microbial load. This consideration naturally leads to the third criterion, in addition to cleanroom classification, the expected microbiological burden should also be taken into account when determining whether datasets can be grouped.
3.3 可行方式 A feasible approach
笔者这里提出一个相对可行的分组决策过程。在无重大工艺或设施变更的前提下,可采用以下逻辑流程(图1):
(1)检测项目是否相同(需考虑单位,检测方法等因素)?
否 → 不可合并;是 → 步骤2
(2)洁净级别是否一致?
否 → 不可合并;是 → 步骤3
(3)进行显著性差异分析(t检验/ANOVA或非参数检验)
存在显著差异(p < 0.05)?是 → 不可合并;否 → 可合并
A practical grouping strategy may proceed as follows, assuming no major process changes, supplier changes, or facility modifications occurred during the evaluation period (Figure 1):
(1) Confirm the same test type (unit, method used should be taken into consideration).
If No, cannot group; If yes, proceed to Step 2.
(2) Confirm the same cleanroom grade.
If No, cannot group; If yes, proceed to Step 3.
(3) Conduct a statistical comparison (e.g., t-test, ANOVA, nonparametric tests) among rooms within the same grade.
If there is a significant difference (usually p < 0.05), do not group. If not, group these datasets.
图1 数据分组流程决策(同洁净区)
Fig 1. Data Grouping Process Flow (for data from the same clean area)
该框架同样可以扩展应用于同一生产基地内的多个独立洁净区(图2),前提是这些洁净区用于生产相似类别的产品,且各区域间未检测到具有统计学显著性的微生物学差异。但笔者不认为可以将不同基地的数据进行合并。
This framework can also be extended across multiple independent clean areas within the same site (Figure 2) provided that these areas are used for manufacturing similar product categories and no statistically significant microbiological differences are observed among them. (the author does not recommend combining data from different sites),
图2 数据分组流程决策(不同洁净区)
Fig 1. Data Grouping Process Flow (for data from multiple clean areas)
引入基于显著性差异(significance testing)的数据分组具有以下优点:(1)以客观数据支撑分组决策,避免主观判断;
(2)可运用的统计测验方法很多,样本量要求相对灵活,适用于不同分布;
(2)审计中可提供明确的统计学支撑。
Introducing data grouping based on statistical significance testing offers several key advantages:
(1) It provides objective and data-driven justification for grouping decisions, thereby minimizing subjective judgment;
(2) A wide range of statistical tests can be applied, with flexible sample size requirements and compatibility across different data distributions;
(3) It offers transparent and auditable statistical justification, which can be clearly demonstrated during inspections or audits.
3. Summary
综上所述,异常值处理与数据分组是限度设定中至关重要的两个环节。在处理异常值时,应避免过度使用统计识别与剔除工具,仅在具有充分证据与文件化理由的情况下进行删除。在数据分组方面,应坚持基于科学依据与统计验证的原则,确保合并后的数据既具代表性,又符合法规期望.
In conclusion, outlier handling and data grouping are two critical components in the establishment of alert and action limits. When dealing with outliers, one should avoid excessive reliance on statistical exclusion methods and only remove data points when there is clear scientific justification and proper documentation to support the action. In terms of data grouping, it is essential to adhere to scientific and statistical principles, ensuring that the grouped datasets remain both representative of the process and aligned with regulatory expectations.
References
[1] Wilson JD. Setting alert/action limits for environmental monitoring programs. PDA J Pharm Sci Technol. 1997 Jul-Aug;51(4):161-2.
[2] Roesti, D. (2019). Calculating Alert Levels and Trending of Microbiological Data. In Pharmaceutical Microbiological Quality Assurance and Control (eds D. Roesti and M. Goverde).
[3] Yang H, Li N, Chang S. A Risk-based Approach to Setting Sterile Filtration Bioburden Limits. PDA J Pharm Sci Technol. 2013 Nov-Dec;67(6):601-9. doi: 10.5731/pdajpst.2013.00942. PMID: 24265301.
[4] PDA TR13, 2022.

