Abstract
Flow cytometry (FC) represents a pivotal technique in the domain of biomedical research, facilitating the analysis of the physical and biochemical properties of cells. The advent of artificial intelligence (AI) algorithms has marked a significant turning point in the processing and interpretation of cytometric data, facilitating more precise and efficient analysis. The application of key AI algorithms, including clustering techniques (unsupervised learning), classification (supervised learning) and advanced deep learning methods, is becoming increasingly prevalent. Similarly, multivariate analysis and dimension reduction are also commonly attempted. The integration of advanced AI algorithms with FC methods contributes to a better understanding and interpretation of biological data, opening up new opportunities in research and clinical diagnostics. However, challenges remain in optimising the algorithms for the specificity of the cytometric data and ensuring their interpretability and reliability.
Citation
Bierzanowski S., Pietruczuk K. Revolution in flow cytometry: using Artificial Intelligence for data processing and interpretation Eur J Transl Clin MedIntroduction
Flow cytometry (FC) is a technique that enables rapid analysis of large numbers of cells in suspension by measuring the light scattered by the cells and fluorescence emitted by fluorochromes conjugated to antibodies [1-2]. The two main detectors are the forward scatter channel (FSC) which detects scattering along the laser beam, determining the size of particle and the side scatter channel (SSC) which measures scattering at 90°, thus assessing the granularity of the cells [3-4]. Other detectors measure the fluorescence produced by excitation of the fluorochrome with laser beam of the appropriate wavelength [3, 5].
FC is used in research and clinical laboratories for the assessment of cell surface antigen and intracellular antigen expression, enzyme activity gene expression and mRNA transcription [6]. This method permits the assessment of the cell cycle, mitochondria and cellular processes (e.g. apoptosis, autophagy, and cell ageing). It allows the quantification of biological substances in various body fluids, including serum and cerebrospinal fluid. FC allows not only the collection and analysis of data about cells, but also the sorting of cells based on the principle of deflecting flowing particles according to their electrical potential [6]. The degree of purity obtained is greater than 99%. This method is also employed to isolate rare cell populations, including cancer cells, fetal erythrocytes, and genetically modified cells [6-7]. The FC technique can be adapted for the detection, characterisation and enumeration of microorganisms in aqueous matrices, as well as somatic and bacterial cells in milk [8]. In medicine, FC is most widely used in haematology and oncology, specifically in cancer diagnosis, classification and monitoring treatment [7].
Despite the technological developments in FC, data analysis remains a key problem, requiring both standardisation and automation [9]. The aim of this article is to present the potential of AI algorithms in the analysis of cytometric data and the problems that still need to be solved to fully automate the process of analyzing this type of data.
Manual analysis of cytometric measurements
Manual gating still is the primary method for analysing the results. This step is essential for obtaining relevant information about the cells under study, whether the goal is to study the phenotype of a population or to identify the internal structures of cells [10]. In the case of the analysis of peripheral blood cells, such as lymphocytes, an FSC vs. SSC plot is initially constructed, which facilitates the distinction of the primary cell populations based on their size and granularity. Once the groups of cells of interest have been selected by setting up further gating, the expression of surface markers in fluorescence plots can be analysed. The manual gating process is complicated, time-consuming, subjectiveand requires advanced knowledge and experience [11-13].
Figure 1. Schematic of manual analysis of cytometric data using peripheral blood mononuclear cells (PBMC) sample as an example
Material and methods
The article is based on a review of the literature available in PubMed (biomedical research publications) and IEEE Xplore (broad access to technical literature in engineering and computer science). The analysis included articles on the application of machine learning algorithms to cytometric data analysis, including methods for mining and interpreting FC data. All included publications were selected for their relevance to the development of ML tools in this field.
Application of AI in cytometric data analysis
The use of artificial intelligence (AI) algorithms to automate the analysis of cytometric measurement data is becoming more common. This approach aims to reduce processing time and improve error resilience compared to manual methods. These algorithms mainly rely on clustering techniques, which involve dividing data based on specific criteria [14]. Clustering includes both classification (data is assigned to predefined classes) and clustering (natural groups in data without prior labels). Dimensionality reduction methods (e.g. principal component analysis, t-distributed stochastic neighbour embedding and uniform manifold approximation and projection) are also increasingly used.
Figure 2. Artifical Intelligence (AI) algorithms currently applicable to the analysis of cytometric measurements in analysis
Clustering techniques – unsupervised machine learning
Unsupervised learning, unlike supervised learning, operates on unlabeled data, identifying patterns and structures without pre-assigned categories.
k-means
The first clustering algorithm used for analyzing cytometric data was k-means [15-16]. This iterative algorithm identifies data points with similar features around a central point called the ‘centroid’. Points closest to the centroid are grouped together, forming clusters. Distance is crucial in this algorithm and can be defined in various ways, but it is often the smallest sum of distances between the centroids and observations [17]. K-means involves several steps: selecting the number of clusters, initializing centroid positions, assigning each data point to the nearest centroid based on distance, recalculating centroids, and repeating these steps until the centroids’ positions stabilize or a stopping condition is met [15, 17].
Figure 3. Visualisation of the operation of k-means clustering algorithm
A – before the application of k-means; B – effect of the algorithm
K-means is a straightforward and effective clustering method, but it faces challenges like computational scalability, which can limit its use in analyzing cytometric data [18]. This algorithm requires substantial computation, with processing time increasing with the number of data points, clusters, and iterations, making it inefficient for large datasets like those from FC. Scalability can be improved by modifying the method to initialize centroids more efficiently or by using random data samples to update the centroids faster. Additionally, parallel computing is explored to further enhance scalability [19-20].
Another disadvantage of k-means is the need for the user to predefine the number of clusters [18]. Cytometric data are often characterised by a complex structure, which makes it difficult to determine the number of clusters, and choosing the wrong number of clusters can affect the biological interpretation of the results due to both under- and overestimation of their actual number [21]. When using the k-means algorithm, it is important to remember that it assumes the sphericity of the clusters and their separation [18]. These assumptions can be a disadvantage in the case of cytometric data obtained from peripheral blood cell measurements such as peripheral blood mononuclear cells (PBMC). This is because the cluster structure of these data is usually complex. PBMCs include different types of cell populations characterised by irregular distributions in the multidimensional feature space which may be caused, for example, by the fluorescence intensity of different surface markers [22]. At the same time, some cell populations have features that may lead to overlapping clusters causing the assumption of separation to fail.
A number of methods used in attempts to automate cytometric analysis have conceptual similarities to k-means or directly apply this algorithm. These methods include FlowClust, FlowMerge and FlowMeans [23-25]. Methods based on the k-means algorithm are often benchmarked and improved. The FlowMerge and flowClust algorithms were found useful in identifying cell populations in real clinical data from patients with chronic lymphocytic leukemia [23]. It was also pointed out that there were some difficulties in identifying clusters compared to other methods when evaluating their work with synthetic data [24].
Gaussian Mixture Model (GMM)
GMM is a probabilistic model that assumes data are generated from specific probability distributions [36]. It models data as a mixture of Gaussian components, each representing a cluster, estimating parameters like mean, variance and cluster weights to determine the likelihood of data points belonging to each cluster [37]. This makes GMM suitable for biomedical data analysis, including FC [38]. GMM is effective for clustering multimodal data with unknown cluster numbers, performing well with both continuous and discrete data, particularly when multiple peaks are present [39-40]. Cytometric data, such as PBMC phenotypes, often exhibit multimodality, making GMM ideal for identifying distinct cell populations, regardless of subtle differences [41-43]. The algorithm excels with continuous data and the Dirichlet Process Gaussian Mixture Model, an extension of GMM, handles unknown cluster numbers, automatically detecting clusters based on data structure [44-45].
HDBSCAN
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a density-based clustering algorithm [26-27] that groups closely located points and identifies outliers as noise. Unlike k-means, the HDBSCAN does not require a predetermined number of clusters, which makes it useful for analysing cytometric data when the number of subpopulations is unknown [27]. HDBSCAN adapts to different densities, identifies clusters of different shapes, but requires the definition of a minimum number of points to form a cluster and a distance measure [27]. This algorithm uses a hierarchical approach to clustering, assessing data membership based on their position in the cluster tree structure [28]. Points form a cluster if they are sufficiently densely packed, while points that do not meet the density criteria, i.e. without a sufficient number of neighbours within a certain radius, are treated as noise and remain unassigned [29-30].
HDBSCAN starts by calculating the distance of each point to its nearest neighbour, estimating the local density. Based on these distances, it creates a graph in which the points are vertices and the edges have weights corresponding to the distances of each other. This graph is used to create a minimum spanning tree, from which the edges with the lowest weights are iteratively removed, splitting the clusters [31]. Small clusters are labelled as noise and larger clusters are given new labels. Finally, the algorithm identifies the most stable clusters as the final result [30, 32]. Despite its efficiency, the HDBSCAN algorithm has a high computational complexity [28-30], making it slower than k-means [30].
Figure 4. The HDBSCAN clustering algorithm
A – scatter plot showing data points distributed across different feature space regions; B – darker colours indicate higher point concentration, suggesting potential clusters; lighter shades indicate lower density, possibly noise; C – graph where points are connected by edges weighted by their distances; D – HDBSCAN-identified clusters, with dense region points forming clusters and others labelled as noise (grey)
Automated analysis of cytometric data in oncology is crucial for faster and more objective patient monitoring. The combination of uniform manifold approximation and projection (UMAP) and HDBSCAN simultaneously reduces data dimensionality and identifies clusters in AML samples, effectively detecting blasts and improving monitoring of minimal residual disease. This approach is superior to traditional supervised methods, particularly with limited data and high variability of leukemic cells [33]. It was also suggested that the HDBSCAN algorithm is useful for the analysis of mitochondrial features by FC [34]. HDBSCAN is used to identify cell populations for cytometric analysis in some commercial programmes, although specific implementations vary depending on the specifics of the tool and analytical requirements [25, 35].
Classification techniques – supervised machine learning
Random forests
Random forest is one of the algorithms of supervised machine learning. The algorithm was developed by Breiman et al. [46]. It may be useful to conceptualise the structure of the proposed algorithm in terms of the organisation of a natural forest. A random forest is composed of individual trees, with each tree functioning as a classifier. These trees operate simultaneously to produce a collective classification output. This outcome is determined by a process of “voting” between the trees, with the result being the classification assigned to the input data.
The random forest algorithm is based on decision trees, which represent sets of decisions to solve problems. Decision trees consist of branches and nodes [47]. Key node types include: root (initial division), internal (specific choices), and leaf (final observations). Branches represent decision paths.
Figure 5. Visualization of a decision tree from Random Forests algorithm
The step-by-step classification process is shown through a series of decision nodes (yellow, green) and final classification outcomes at the leaf nodes (blue). The paths illustrate how data is split based on feature thresholds to reach the final decision.
There are several learning algorithms for decision trees, including: Id3, C4.5, CART, CHAID [48]. Decision trees are not complex structures, their implementation is relatively straightforward and does not require the appropriate scaling of features. They demonstrate high precision and accuracy in classification tasks, which is worth considering in analysis of FC data. On the other hand, they are not suitable for working on small datasets which can be a limitation in FC data analysis [49]. Presentation of the structure and functioning of decision trees is important in the context of the random forest algorithm because as mentioned earlier, random forests consist of multiple decision trees. The basic mechanism responsible for the generation of random forests is bagging (bootstrap aggregation), a method introduced by Breiman et al. [50]. It generates multiple predictor versions and later aggregates them. In random forests, bagging creates independent decision trees, each trained on unique bootstrap samples (random data samples with repetitions) [50]. The classification outcome is determined through a process known as majority voting. The technique of majority voting is a method of combining the predictions of results from multiple classifiers. Each decision tree “votes” for a class, and the class with the most votes becomes the final prediction [51-52]. This algorithm can be used as a classifier in cytometric data analysis.
Random forest, is an easy-to-implement machine learning algorithm, excels at detecting meaningful data patterns [53]. It can reveal subtle features often overlooked by traditional statistical methods. Those features are often crucial in the context of a medical diagnosis [54]. In one study, a random forest model was implemented to identify significant details within the acquired cytometric data, with the objective of increasing the accuracy of diagnosis. Researchers collected blood samples from 230 individuals, including those with myelodysplastic syndromes (MDS) and healthy controls. They then used FC to evaluate the cellular composition of these samples. A random forest model was utilized to analyse the collected cytometric data, facilitating the more accurate detection and classification of significant cellular patterns associated with MDS [54]. The ability of random forests to identify subtle relationships in complex, multi-parameter data has enabled researchers to diagnose the presence of myelodysplastic syndromes with greater accuracy. The model achieved 92% classification accuracy, a high and satisfactory result. The random forest algorithm is relatively resistant to overfitting, a situation in which the machine learning model provides good results based on the data on which it was trained, but is instead ineffective in analysing new, previously unseen data [55]. FC data can include many features (markers) for each cell, and the number of cells analysed is very large [56]. In such complex datasets, it is easy to have random patterns that can be misleading and lead to over-fitting, particularly when using simpler models such as single decision trees. Random forests are easy to interpret [53]. This is a major advantage in the context of cytometric data analysis. It helps to understand which features contribute to the classification of different cell populations. This makes it possible to identify biologically relevant markers [57], which is essential in the context of medical diagnostics and biomedical research. The model, which is simple to interpret, also makes it easier to verify results, which increases the reliability and precision of analyses.
While the random forest algorithm offers a number of advantages, it is not free from limitations that may affect its effective application in FC data analysis. A significant limitation of machine learning models is their high computational cost [58]. Cytometric data is frequently large and complex and multidimensional. As a result, training models based on this type of data can be costly and challenging from an economic perspective. Although random forests are resistant to over-fitting, in some cases they can be prone to this problem, especially when working with data containing a lot of noise. In the context of FC, we can interpret noisy data as data with measurement errors or issues caused by sample heterogeneity or biological variability [59-60], all of which can affect the generation and propagation of errors in the classification performed by the model.
Figure 6. Visualisation of the implementation of Random Forests
A – before the application of Random Forests; B – effect of algorithm, classification into three groups
Support Vector Machines (SVM)
SVM was first introduced by Cortes and Vapnik et al. in 1995 [61]. The model was intended to be an effective alternative to the neural networks, which were still in development and presented certain technical challenges. SVM involves complex mathematical concepts including hyperplanes, margins, support vectors, kernels, and optimization. A hyperplane is a linear decision function that allows separation of data classes from each other. The prefix “hyper” indicates that this plane refers to multiple dimensions. For n dimensions, a hyperplane will take on (n–1) dimensions [61]. The margin is the distance between the hyperplane and the nearest data point [48]. Support vectors are data points that are located on a hyperplane. Optimization is the process of finding a hyperplane with as much margin as possible to best separate data points [62]. A kernel is a mathematical function that enables the transformation of data from a lower-dimensional space into a higher-dimensional space, allowing for the separation of the data [62]. In the context of FC data, an illustrative example would be the separation of cells in a PBMC graph with FSC and SSC parameters. In the two-dimensional space of the graph, it is not possible to linearly separate these cells. However, the kernel can be used to move the data points to a three-dimensional space, where it is possible to linearly separate them. The addition of an extra dimension allows for the creation of a hyperplane that can better represent the dataset.
SVM are particularly effective in classification tasks that require the detailed separation of data. One illustrative example is the use of a SVM as a classification tool for identifying circulating tumour cells (CTCs) in the bloodstream. In one study, blood samples were collected from 41 healthy individuals and 41 patients with colorectal cancer, and CTCs were counted on the basis of the results obtained from FC. An SVM classifier based on the number of CTCs was developed and achieved an 82.3% accuracy [63]. It has been demonstrated that the application of this cytometric data in the context of SVM learning can facilitate the effective differentiation between healthy and cancerous blood samples. The high performance of this model suggests its potential future use as a non-invasive cancer screening tool. SVM models are also suitable for identifying the presence of rare cells in peripheral blood. In one study, an SVM model was developed to identify rare cell types in FC data and it demonstrated an accuracy of 69%, compared to traditional manual classification. This tool could be used in the future for more precise analysis of FC data, particularly in the identification of rare cell types, which may be important in both disease diagnosis and therapy monitoring [64].
Despite their benefits, SVMs have limitations, particularly with data imbalance. FC datasets often contain underrepresented cell types, leading SVMs to create hyperplanes biased towards majority types, which may not be optimal for less abundant cell types [65]. Cytometric data are typically multidimensional and complex, necessitating careful kernel selection and parameter tuning to optimize model performance. This process, though time-consuming, is crucial to prevent over- or under-fitting [66-69]. Additionally, SVMs’ computational complexity in high-dimensional spaces can be a drawback, particularly in large cytometric datasets where rapid analysis is required [61, 70].
Deep learning
Although not new, deep learning has rapidly advanced due to increased computing power [71]. It is widely applied in medical fields, including immunology [72]. In simplified terms, neural networks learn through a two-stage process. First, the network receives a substantial amount of data, which it then uses to attempt to predict an outcome. Afterwards, it verifies the difference between the predicted outcome and the assumed outcome. This is an iterative process, in which the accuracy of the predictions is increased through the adjustment of weights. Each iteration results in a greater accuracy in the resulting predictions [73]. The actual learning process of neural networks is inherently complex, relying on several mathematical and statistical principles, which require deep understanding of linear algebra, calculus, probability and mathematical optimization. However it is possible to explain this process in much more simple terms by using one of the simplest models: a single-layer neural network.
To illustrate this, we can take the example of a manually curated FC dataset distinguishing healthy and cancerous cells. A single-layer neural network with an input layer and an output layer connected by randomly initialized weights is used. FC data, representing cell features, are entered, weighted, summed and passed through an activation function to capture patterns. The network predicts the cell’s type, and the error between the prediction and the true class is calculated. Using backpropagation, weights are adjusted iteratively to minimize the error [74-76].
In FC, deep learning enhances diagnostic efficiency by reducing analysis time and improving feature extraction [54]. Recent advancements have broadened its applications, even in challenging areas [77]. For example, a deep learning model effectively detected rare tumor cell clusters in breast cancer biopsies, though it showed lower sensitivity, indicating the need for larger datasets [78]. Deep learning models have achieved high efficiency in acute myeloid leukemia (AML) diagnosis, distinguishing AML from acute lymphoblastic leukemia with near-perfect accuracy [79]. Neural networks have proven effective in analysing multi-parameter flow cytometry data, aiding in leukemic classification [80-81]. Automation of pattern detection through neural networks significantly improves the precise classification of leukemic subtypes.
Deep learning in FC relies heavily on large training datasets, which may be challenging to obtain in smaller cytometric studies [78, 82]. Training complex neural networks also demands advanced hardware [71]. An additional drawback is the so-called “black box problem” which refers to the difficulty in explaining how a neural network makes specific decisions and generates the final outcome of its predictions [83]. In the context of cytometric data analysis, it is often unclear which specific cellular characteristics influenced the model to make certain diagnostic decisions. Simpler machine learning algorithms might sometimes offer more efficient solutions in this context.
Dimensionality reduction
Dimensionality reduction is an important group of machine learning methods, particularly in data analysis with many variables. It is the process of simplifying a dataset by reducing the number of variables, while retaining as much relevant information as possible [84]. This is the biggest advantage of the method, as a large number of variables often leads to problems with model over-fitting, often referred to as the ‘curse of dimensionality’ [85-86]. The second advantage is data compression and faster calculations [86]. The most commonly used methods for dimensionality reduction are principal component analysis (PCA), independent component analysis (ICA), t-Distributed Stochastic Neighbour Embedding (t-SNE) and the previously mentioned UMAP [87-89]. For cytometric data, t-SNE algorithm is often used in commercial software.
t-SNE is a non-linear, unsupervised technique mainly used for the exploration and visualisation of multivariate data [90]. It is a stochastic method, ordering ‘neighbours’ while preserving local data structures and using the Student’s t-distribution to model distances in low-dimensional space [91]. The algorithm allows separation of data that cannot be separated by a straight line, which is important for cytometric data.
It is a valuable tool in cell biology and immunological research, e.g. to profile cells of the immune system to understand their diversity, function and role in the immune response. The t-SNE algorithm has enabled the identification of subpopulations of normal and leukemic lyphocytec and the evaluation of their expression of immunosuppressive markers, clearly separating them from normal haematopoietic cells [92-93]. Combining t-SNE with unsupervised learning algorithms enables analysis of cytometric data to detect residual disease with high sensitivity [94]. The algorithm also supported the analysis of PBMC multicolour FC data, identifying rare subgroups of vaccine-induced T and B-cells [95].
Dimensionality reduction using t-SNE is effective in visualising immune cells and quantifying their frequencies, showing high agreement with conventional manual gating. However, it may not fully separate specific subsets of immune cells, leading to some discrepancies in the identification and quantification of these subpopulations, which is why the need to modify the algorithm in the analysis of cytometric data is highlighted, as standard parameter settings may lead to inaccurate or misleading cell maps [96-97]. The disadvantages of t-SNE are the high computational cost, difficult interpretation and need to set parameters that require tuning and experimentation.
Conclusions
Machine learning algorithms enable automated and precise data analysis, reducing errors due to subjectivity. However, we also face challenges. A key challenge is standardisation to ensure reproducibility and reliability. It is essential in laboratory diagnostics and biomedical sciences, as it enables comparison of results between laboratories and supports the introduction of modern techniques in routine diagnostics and clinical research. Without standardisation, it is difficult for these methods to be accepted in clinical practice. Other challenges include high computational costs, due to the fact that cytometric analyses involve multidimensional data of large size, requiring adequate memory resources and computing power. These costs can be reduced by optimising the performance of algorithms and using cloud-based solutions. As the computational complexity of the algorithms plays a key role here, a more practical selection of algorithms for specific cytometry applications also seems necessary.
Conflict of interest
None.
Funding
None.