Interactive Platform for Synthetic Flow Cytometry Data Generation for Education

This research project was conducted as part of my M.Sc. in Data Science at the Robert Gordon University in Scotland, UK. The project poster and abstract presented here provides a brief overview of the project background, methodology and implementation.

Flow cytometry (FCM) is a crucial technology in biomedical research, enabling high-dimensional analysis of cellular properties across fields such as immunology, oncology, and regenerative medicine. However, its educational application is hindered by challenges in data accessibility, privacy concerns, and the complexity of real-world datasets, which often contain technical noise and biological variability. These factors create steep learning curves for beginners, necessitating cleaner, customizable datasets to teach core analytical concepts like gating and clustering. While synthetic data generation offers a promising solution, existing tools lack user-friendly interfaces for dynamically simulating biologically plausible cell populations.

This project addresses this gap by developing an interactive web-based platform for generating synthetic FCM data tailored to educational needs. The platform comprises two complementary applications built using Streamlit and Python. App-1 enables synthetic data generation "from scratch" via user-defined population characteristics, leveraging Gaussian distributions to create idealized datasets with adjustable parameters, class imbalance correction, and watermarking for traceability. App-2 extracts populations from experimental FCM files, employing Gaussian Mixture Models (GMM) and dimensionality reduction (UMAP/t-SNE) to replicate complex biological structures while allowing dynamic resampling and visualization. Both apps support real-time adjustments, export synthetic data in standard FCS/CSV formats, while App-2 integrate validation metrics (Silhouette Score, Mutual Information, Wasserstein Distance) to evaluate the quality of the generated synthetic data against the real data.

The platform successfully bridges educational needs by providing educators with a user-friendly platform to generate reproducible, noise-free synthetic FCM data files mimicking real world biological scenarios. The interactive platform has the potential to be enhanced to provide researchers with annotated synthetic datasets to benchmark algorithm validation. Key innovations include interactive cell population editing, configuration file sharing for reproducibility, and tools to simulate diverse experimental scenarios. While the reliance on Gaussian models in App-1 simplifies biological realism, the integration of GMM in App-2 ensures statistical rigor for advanced use cases. Future work could enhance biological fidelity through non-Gaussian distributions and expanded metadata support. This project underscores the transformative potential of synthetic data in democratizing FCM education, fostering standardized training, and accelerating methodological developments in cytometry analysis.