Introduction

Gene regulation is a dynamic process largely determined by the physical access of chromatin-binding factors such as transcription factors (TFs) to regulatory regions of the DNA (e.g., enhancers and promoters) (Klemm, Shipony, and Greenleaf 2019). The genome-wide landscape of chromatin accessibility is essential in the control of cellular identity and cell fate and thus varies in different cell types (K. Zhang et al. 2021; Klemm, Shipony, and Greenleaf 2019). Over the last decade, Assay for Transposase-Accessible Chromatin (ATAC-Seq) (Buenrostro et al. 2013) has become a reference epigenomic technique to profile chromatin accessibility and the activity of gene regulatory elements in diverse biological contexts including cancer (Luo, Gribskov, and Wang 2022) and across large cohorts (Corces et al. 2018). Several optimized ATAC-seq protocols have been developed to improve the quality of ATAC-Seq data and expand its usage to different tissue types. These include the OMNI-ATAC protocol, which leads to cleaner signal and is applicable to frozen samples (Corces et al. 2017; Grandi et al. 2022), as well as the formalin-fixed paraffin-embedded (FFPE)-ATAC protocol adapted to FFPE samples. The reasonable cost and technical advantages of these protocols foreshadow an increased usage of ATAC-Seq in cancer studies.

Most biological tissues are composed of multiple cell types. For instance, tumors are complex ecosystems including malignant and stromal cells as well as a large diversity of immune cells. This cellular heterogeneity, in particular the presence of specific immune cell types, impacts tumor progression as well as response to immunotherapy (Fridman et al. 2012; 2017; de Visser and Joyce 2023). Most existing ATAC-Seq data from tumors were performed on bulk samples, thereby including information from both cancer and non-malignant cells. Precisely quantifying the proportions of different cell types in such samples represents therefore a promising way to explore the immune contexture and the composition of the tumor micro-environment (TME) across large cohorts. Carefully assessing cell-type heterogeneity is also important to handle confounding factors in genomic analyses in which samples with different cellular compositions are compared. Recently, single-cell ATAC-Seq (scATAC-Seq) has been developed to explore cellular heterogeneity with high resolution in complex biological systems (Cusanovich et al. 2015; Lareau et al. 2019; Satpathy et al. 2019). However, the resulting data are sensitive to technical noise and such experiments require important resources, which so far limits the use of scATAC-Seq in contrast to bulk sequencing in the context of large cohorts.

In the past decade, computational deconvolution tools have been developed to predict the proportion of diverse cell types from bulk genomic data obtained from tumor samples (Avila Cobos et al. 2018; 2020; Sturm et al. 2019; Racle et al. 2017; Monaco et al. 2019; Newman et al. 2019; H. Li et al. 2020; Finotello et al. 2019; Becht et al. 2016). A large number of these tools model bulk data as a mixture of reference profiles identified in purified cell populations for each cell type. The accuracy of the predictions of cell-type proportions relies on the quality of these reference profiles as well as on the use of cell-type specific markers (Avila Cobos et al. 2018). A limitation of most deconvolution algorithms is that they do not predict the proportion of cell types that are not present in the reference profiles (here referred to as ‘uncharacterized’ cells). In the context of cancer samples, these uncharacterized cell populations include malignant cells whose molecular profiles differ not only from one cancer type to another, but also from one patient to another even within the same tumor type (Corces et al. 2018). A few tools consider uncharacterized cells in their deconvolution framework by using cell-type specific markers not expressed in the uncharacterized cells (Clarke, Seol, and Clarke 2010; Gosink, Petrie, and Tsinoremas 2007; Racle et al. 2017; Finotello et al. 2019). These tools include EPIC (Estimating the Proportion of Immune and Cancer cells) which simultaneously quantifies immune, stromal, vascular as well as uncharacterized cells from bulk tumor samples (Racle et al. 2017; Racle and Gfeller 2020).

Most deconvolution algorithms have been initially developed for transcriptomic data (RNA-Seq data) (Newman et al. 2015; Racle et al. 2017; Finotello et al. 2019; Monaco et al. 2019; Newman et al. 2019; T. Li et al. 2020; Jimenez-Sanchez, Cast, and Miller 2019; Gong and Szustakowski 2013). More recently they have been adapted for other omics layers such as methylation (Chakravarthy et al. 2018; Teschendorff et al. 2020; Arneson, Yang, and Wang 2020; H. Zhang et al. 2021) and proteomics (Feng et al. 2023) or chromatin accessibility. For the latter, a specific framework called DeconPeaker (H. Li et al. 2020) was developed to estimate cell-type proportions from bulk samples. Deconvolution tools developed initially for other omics modalities, such as RNA-Seq, can also be applied on ATAC-Seq if appropriate ATAC-Seq profiles are provided to the tool. For example, the popular deconvolution tool, CIBERSORT (Newman et al. 2015), was used to deconvolve leukemic ATAC-Seq samples (Corces et al. 2016). Other methods have been proposed to decompose ATAC-Seq bulk profiles into subpopulation-specific profiles (Zeng et al. 2019; Burdziak et al. 2019) or compartments (Peng et al. 2019). However, these methods have more requisites: (i) the integration of the ATAC-Seq data with single-cell or bulk RNA-Seq (Zeng et al. 2019; Burdziak et al. 2019) and HIChIP data (Zeng et al. 2019) or, (ii) subsequent feature annotation to associate compartments with cell types or biological processes (Peng et al. 2019).

The application of existing bulk ATAC-Seq data deconvolution tools to solid tumors is limited. First, current computational frameworks do not quantify populations of uncharacterized cell types. Second, ATAC-Seq based markers (i.e., chromatin accessible regions called peaks) and reference profiles generated so far have been derived in the context of hematopoietic cell mixtures (Corces et al. 2016; H. Li et al. 2020). Markers and profiles for major populations of the TME (e.g., stromal and vascular cells) are thus missing. While cell-type specific markers have been identified from scATAC-Seq data (K. Zhang et al. 2021), not all TME-relevant cell types are covered (e.g., lack of scATAC-Seq data from neutrophils due to extracellular traps formation). Also, these markers have not been curated to fulfill the requirements of tools such as EPIC to quantify uncharacterized cells (i.e., markers of a cell-type should not be accessible in other human tissues).

In this study, we collected ATAC-Seq data from pure cell types to identify cell-type specific marker peaks and to build reference profiles from most major non-malignant cell types typically observed in tumors. These data were integrated in the EPIC (Racle et al. 2017) framework to perform bulk ATAC-Seq samples deconvolution (Figure 1). Applied on peripheral blood mononuclear cells (PBMCs) and tumor samples, the EPIC-ATAC framework showed accurate predictions of the proportions of non-malignant and malignant cells with similar or higher performances than other existing tools.

Graphical description of the identification of cell-type specific marker peaks and reference ATAC-Seq profiles included in the EPIC-ATAC framework. 1) 564 pure ATAC-Seq data of sorted cells were collected to build reference profiles for cancer-relevant cell populations. 2) Cell-type specific marker peaks were identified using differential accessibility analysis. 3) Markers with previously observed chromatin accessibility in human healthy tissues were then excluded. 4) For tumor bulk deconvolution, the set of remaining marker peaks was refined by selecting markers with correlated behavior in tumor bulk samples. 5) The cell-type specific marker peaks and reference profiles were finally integrated in the EPIC-ATAC framework to perform bulk ATAC-Seq deconvolution. Parts of this figure were created with BioRender.com.

Results

ATAC-Seq data from sorted cell populations reveal cell-type specific marker peaks and reference profiles

A key determinant for accurate predictions of cell-type proportions by most deconvolution tools is the availability of reliable cell-type specific markers and reference profiles. To identify robust chromatin accessibility marker peaks of cancer relevant cell types, we collected 564 samples of sorted cell populations from twelve studies including eight immune cell types (B cells (Calderon et al. 2019; Corces et al. 2016; P. Zhang et al. 2022), CD4+ T cells (Corces et al. 2016; Liu et al. 2020; P. Zhang et al. 2022; Mumbach et al. 2017; Giles et al. 2022), CD8+ T cells (Calderon et al. 2019; Corces et al. 2016; Liu et al. 2020; P. Zhang et al. 2022; Giles et al. 2022), natural killer (NK) cells (Calderon et al. 2019; Corces et al. 2016), dendritic cells (DCs) (Calderon et al. 2019; Leylek et al. 2020; Liu et al. 2020), macrophages (Liu et al. 2020; P. Zhang et al. 2022), monocytes (Calderon et al. 2019; Corces et al. 2016; Leylek et al. 2020; P. Zhang et al. 2022; Trizzino et al. 2021) and neutrophils (Ram-Mohan et al. 2021; Perez et al. 2020), as well as fibroblasts (Ge et al. 2021; Liu et al. 2020) and endothelial (Liu et al. 2020; Xin et al. 2020) cells (Figure 1 box 1, Figure 2A, Supplementary Table 1). To limit batch effects, the collected samples were homogeneously processed from read alignment to peak calling. For each cell type, we derived a set of stable peaks, i.e., peaks observed across samples and studies (see Materials and Methods).

ATAC-Seq data from sorted cell populations reveal cell-type specific marker peaks and reference profiles.

A) Number of samples collected for each cell type. The colors correspond to the different studies of origin. B) Representation of the collected samples in 2D using UMAP based on the PBMC markers (left) and TME markers (right). Colors correspond to cell types. C) Scaled averaged chromatin accessibility of the cell-type specific marker peaks (rows) in each cell type (columns) in the ATAC-Seq reference samples used to identify the marker peaks. D) Scaled averaged chromatin accessibility of the marker peaks in external ATAC-Seq data from samples of pure cell types excluded from the reference samples (see Material and Methods). E) Scaled averaged chromatin accessibility of the marker peaks in an external scATAC-Seq dataset (Human Atlas (K. Zhang et al. 2021)). F) Distribution of the marker peak distances to the nearest transcription start site (TSS) (left panel) and the ChiPSeeker annotations (right panel). G) Significance (-log10(q.value)) of pathways (columns) enrichment test obtained using ChIP-Enrich on each set of cell-type specific marker peaks (rows). A subset of relevant enriched pathways is represented. Colors of the names of the pathways correspond to cell types where the pathways were found to be enriched. When pathways were significantly enriched in more than one set of peaks, pathways names are written in bold.

These peaks were then used to perform pairwise differential analysis to identify marker peaks for each cell type (Figure 1, box 2). To ensure that the cell-type specific marker peaks are not accessible in other human tissues, we included in the differential analysis ATAC-Seq samples from diverse human tissues from the ENCODE data (The ENCODE Project Consortium et al. 2020; Rozowsky et al. 2023) (Supplementary Figure 1). To select a sufficient number of peaks prior to peak filtering, the top 200 peaks recurrently differentially accessible across all cell-type pairs were selected as cell-type specific markers (see Materials and Methods). Using the human atlas study (K. Zhang et al. 2021), markers with potential residual accessibility in human tissues were then filtered out (Figure 1, box 3, see Materials and Methods). The resulting marker peaks specific to the immune cell types were considered for the deconvolution of PBMC samples (PBMC markers). For tumor bulk sample deconvolution, the list of markers was further refined based on the correlation patterns of the markers in tumor bulk samples from diverse cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018) (Figure 1, box 4, see the Material and methods). The latter filtering ensures the relevance of the markers in the TME context since cell-type specific TME markers are expected to be correlated in tumor bulk ATAC-Seq measurements (Qiu et al. 2021). 716 markers of immune, fibroblasts and endothelial cell types remained after the later filtering and were considered for the deconvolution of bulk tumor samples (TME markers).

To assess the quality and reproducibility of these markers, we performed principal component analysis (PCA) based on each set of marker peaks. Computing silhouette coefficients based on the cell-type classification and on the study of origin showed that samples clustered by cell type and not by study of origin (averaged silhouette coefficients above 0.45 for cell type and around 0 for study of origin). Two-dimensional UMAP representations of the samples confirmed this observation (Figure 2B). These results indicate limited remaining batch effects after data processing and marker selection.

We then used the collected samples to generate chromatin accessibility profiles by computing the average of the normalized counts for each peak in each cell type as well as peak variability in each cell type (Racle et al. 2017) (see Material and methods). Figure 2C represents the average chromatin accessibility of each marker peak in each cell type of the reference dataset and highlights, as expected, the cell-type specificity of the selected markers (see also Supplementary Tables 2 and 3), which was confirmed in independent ATAC-Seq data from sorted cells and single-cell ATAC-Seq samples from blood and diverse human tissues (Figure 2D and 2E, see Materials and methods).

Annotations of the marker peaks highlight their biological relevance

To characterize the different marker peaks, we annotated them using ChiPSeeker (Yu, Wang, and He 2015). We observed that most of the markers are in distal and intergenic regions (Figure 2F), which is expected considering the large proportion of distal regions in the human genome and the fact that such regions have been previously described as highly cell-type specific (Corces et al. 2016). We also noticed that 7% of the PBMC and TME marker peaks are in promoter regions in contrast to 4% when considering matched genomic regions randomly selected in the set of peaks identified prior to the differential analysis (see Material and methods), which suggest enrichment in our marker peaks for important regulatory regions.

To assess the biological relevance of the marker peaks, we associated each marker peak to its nearest gene using ChIP-Enrich based on the “nearest transcription start site (TSS)” locus definition (Welch et al. 2014) (Supplementary Tables 4 and 5). Nearest genes reported as known marker genes in public databases of gene markers (i.e., PanglaoDB (Franzén, Gan, and Björkegren 2019) and CellMarker (Hu et al. 2023)) are listed in Table 1.

List of nearest genes and enriched CBPs reported in the PanglaoDB or CellMarker databases.

In each set of cell-type specific peaks, we observed an overrepresentation of chromatin binding proteins (CBPs) reported in the JASPAR2022 database (Castro-Mondragon et al. 2022) (using Signac (Stuart et al. 2021) and MonaLisa (Machlab et al. 2022) for assessing the overrepresentation) and the ReMap catalog (Hammal et al. 2022) (using RemapEnrich, see Material and Methods). Overrepresented CBPs also reported as known marker genes in the PanglaoDB and CellMarker databases are listed in Table 1. Detailed peaks annotations are summarized in Supplementary Tables 4 and 5.

Based on the “nearest TSS” annotation, we tested, using ChIP-Enrich (Welch et al. 2014), whether each set of cell-type specific marker peaks was enriched for regions linked to specific biological pathways (GO pathways). Figure 2G highlights a subset of the enriched pathways that are consistent with prior knowledge on each cell type. Some of these pathways are known to be characteristic of immune responses to inflammatory or tumoral environments. The complete list of enriched pathways is listed in the Supplementary Tables 6 and 7. Overall, these analyses demonstrate that the proposed cell-type specific marker peaks capture some of the known biological properties associated to each cell type.

EPIC-ATAC accurately estimates immune cell fractions in PBMC ATAC-Seq samples

The cell-type specific marker peaks and profiles derived from the reference samples were integrated to the EPIC deconvolution tool (Racle et al. 2017; Racle and Gfeller 2020). We will refer to this ATAC-Seq deconvolution framework as EPIC-ATAC. To test the accuracy of EPIC-ATAC predictions, we first collected PBMCs from five healthy donors. In each donor, half of the cells was used to generate a bulk ATAC-Seq dataset and the other half was used to determine the cellular composition of each sample, i.e., the proportions of monocytes, B cells, CD4+ T cells, CD8+ T cells, NK cells and dendritic cells, by multiparametric flow cytometry (Figure 3A, see Materials and methods). We then applied EPIC-ATAC to the bulk ATAC-Seq data. The predicted cell fractions are consistent with the cell fractions obtained by flow cytometry (Figure 3B, Pearson correlation coefficient of 0.78 and root mean squared error (RMSE) of 0.10).

EPIC-ATAC accurately estimates immune cell fractions in PBMC ATAC-Seq samples.

A) Schematic description of the experiment designed to validate the ATAC-Seq deconvolution on PBMC samples. B) Comparison between cell-type proportions predicted by EPIC-ATAC and the true proportions in the PBMC bulk dataset. Symbols correspond to donors. C) Comparison between the proportions of cell-types predicted by EPIC-ATAC and the true proportions in the PBMC pseudobulk dataset. Symbols correspond to pseudobulks. D) Pearson correlation (left) and RMSE (right) values obtained by each deconvolution tool on the PBMC bulk dataset. The EPIC-ATAC results are highlighted in red. E) Pearson correlation (left) and RMSE (right) values obtained by each deconvolution tool on the PBMC pseudobulk dataset. Parts of this figure (panel 1) were created with BioRender.com.

As a second validation, we applied EPIC-ATAC to pseudo-bulk PBMC samples (referred to as the PBMC pseudobulk dataset, generated using three publicly available PBMC scATAC-Seq datasets (Satpathy et al. 2019; Granja et al. 2019; 10x Genomics 2021), see Material and methods). A high correlation (0.91) between EPIC-ATAC predictions and true cell-type proportions and a low RMSE (0.05) were observed for this dataset (Figure 3C).

The accuracy of the predictions obtained with EPIC-ATAC was then compared with the accuracy of other deconvolution approaches which could be used with our reference profiles and marker peaks (Figure 3D-E). To this end, we considered both the DeconPeaker method (H. Li et al. 2020) originally developed for bulk ATAC-Seq as well as several algorithms developed for bulk RNA-Seq (CIBERSORTx (Newman et al. 2019), QuanTIseq (Finotello et al. 2019), ABIS (Monaco et al. 2019), and MCPcounter (Becht et al. 2016)). To enable meaningful comparison across the cell types considered in this work and use the method initially developed for bulk RNA-Seq deconvolution, the marker peaks and profiles derived in this work were used in each of these methods. DeconPeaker and CIBERSORTx include the option to define cell-type specific markers and profiles from a set of reference samples. We thus fed our ATAC-Seq samples collection to both algorithms and used the resulting profiles and marker peaks to perform bulk ATAC-Seq deconvolution. The resulting predictions are referred to as DeconPeaker-Custom and CIBERSORTx-Custom.

Many tools displayed high correlation and low RMSE values, similar to those of EPIC-ATAC, and no single tool consistently outperformed the others (Figure 3D-E, Supplementary Figure 2A-C). The fact that our marker peaks and reference profiles could be used with EPIC-ATAC and other existing tools demonstrates their broad applicability.

Predictions accuracies were also evaluated in each cell type separately. Since the number of samples was low in each dataset, samples from both datasets were combined for this analysis. EPIC-ATAC demonstrated good accuracies across cell types with RMSE values ranging from 0.02 for B cells to 0.13 for NK cells (Supplementary Figure 3). As expected, predictions with all tools were more accurate for frequent cell types with well-characterized markers (e.g., CD8/CD4 T cells, B cells) compared to less frequent cell types (e.g., NK cells, dendritic cells) (Supplementary Figure 2 and 3). Note that MCPcounter is a marker-based method that derives cell-type specific scores which cannot be compared between cell types. This method was thus only included in the benchmark considering each cell type separately.

EPIC-ATAC accurately predicts fractions of cancer and non-malignant cells in tumor samples

We evaluated the ability of the EPIC-ATAC framework to predict not only immune and stromal cells proportions but also the proportion of cells for which reference profiles are not available (i.e., uncharacterized cells). For this purpose, we considered two previously published scATAC-Seq datasets containing basal cell carcinoma and gynecological cancer samples (Satpathy et al. 2019; Regner et al. 2021). We generated two pseudobulk datasets by averaging the chromatin accessibility signal across all cells of each sample (see Material and methods). Applying EPIC-ATAC to both datasets shows that this framework is able to simultaneously predict the proportions of both uncharacterized cells and immune, stromal and vascular cells (Figure 4A). In these cancer samples, the proportion of uncharacterized cells can be seen as a proxy of the proportion of cancer cells.

EPIC-ATAC accurately predicts fractions of cancer and non-malignant cells in tumor samples.

A) Comparison between cell-type proportions estimated by EPIC-ATAC and true proportions for the basal cell carcinoma (top) and gynecological (bottom) pseudobulk datasets. Symbols correspond to pseudobulks. B) Pearson’s correlation and RMSE values obtained for the deconvolution tools included in the benchmark. EPIC-ATAC is highlighted in red. C) Same analyses as in panels B, with the uncharacterized cell population excluded for the evaluation of the predictions accuracy. The predicted and true proportions of the immune, stromal and vascular cell types were rescaled to sum to 1.

As for the PBMC datasets, we compared EPIC-ATAC performances to other existing deconvolution tools. For both datasets, EPIC-ATAC led to the highest performances and was the only method to accurately predict the proportion of uncharacterized cells (Figure 4B, Supplementary Figure 4 and 5). Although quanTIseq also allows users to perform such predictions, the method resulted in lower correlation and higher RMSE values when comparing the estimated and true proportions of the uncharacterized cells (Figure 4B, Supplementary Figure 4).

In the EPIC-ATAC and quanTIseq frameworks, predictions correspond to absolute cell-type fraction, i.e., proportions of all cells present in the bulk, while the estimations obtained from the other tools correspond to relative cell fractions, i.e., proportions of cells present in the reference profiles (CIBERSORTx, DeconPeaker) or to scores with arbitrary units (ABIS, MCPcounter). We thus conducted a second benchmark excluding the predictions of uncharacterized cell fractions and rescaling both estimations and true proportions to sum to 1 (see Material and methods). EPIC-ATAC outperformed most of the other methods also when excluding the uncharacterized cells (Figure 4C, Supplementary Figure 4 and 5). Supplementary Figure 6 reports the performances of each tool when considering each cell type separately. Overall, EPIC-ATAC showed comparable or higher correlation and lower RMSE values when compared to the other deconvolution tools.

T cell subtypes quantification reveals the ATAC-Seq deconvolution limits for closely related cell types

To explore the limitations of ATAC-Seq deconvolution, we next evaluated whether EPIC-ATAC could predict the proportions of T-cell subtypes. To this end, we considered naive and non-naive CD8+ as well as naïve, helper/memory and T regulatory CD4+ T cells. We redefined our list of cell-type specific marker peaks and reference profiles including also these five T-cell subtypes (Supplementary Tables 8-9, Supplementary Figure 7A) and observed that the markers were conserved in external data (Supplementary Figure 7B). The annotations of the markers associated to the T-cell subtypes are available in Supplementary Tables 10-13.

We capitalized on the more detailed cell-type annotation of the PBMC datasets as well as the basal cell carcinoma dataset to evaluate the EPIC-ATAC prediction of cell-subtype fractions using these updated markers and profiles. Overall, the correlations observed between the predictions and true proportions of T cells decreased when considering T-cell subtypes rather than CD4+ and CD8+ cell types only (Figure 5A). In particular, low accuracies were obtained for helper/memory CD4+ and naïve T-cell subtypes (Figure 5B). Similar results were obtained using other deconvolution tools (Supplementary Figure 8).

T cell subtypes quantification reveals the ATAC-Seq deconvolution limits for closely related cell types.

A) Comparison of the proportions estimated by EPIC-ATAC and the true proportions for PBMC samples (PBMC experiment and PBMC pseudobulk samples combined) (top) and the basal cell carcinoma pseudobulks (bottom). Predictions of the proportions of CD4+ and CD8+ T-cells were obtained using the reference profiles based on the major cell types and subtype predictions using the reference profiles including the T-cell subtypes. B) Pearson’s correlation values obtained by EPIC-ATAC in each cell type.

EPIC-ATAC accurately infers the immune contexture in a bulk ATAC-Seq breast cancer cohort

We applied EPIC-ATAC to a breast cancer cohort of 42 breast ATAC-Seq samples including samples from two breast cancer subtypes, i.e., 35 oestrogen receptor (ER)-positive human epidermal growth factor receptor 2 (HER2)-negative (ER+/HER2-) samples and 7 triple negative (TN) tumors (Kumegawa et al. 2023). No cell sorting was performed in parallel to the chromatin accessibility sequencing. We thus used EPIC-ATAC to estimate cell-type proportions. We observed a higher proportion of T cells, B cells, NK cells and macrophages in the TN samples in comparison to ER+/HER2-samples (Figure 6A).

EPIC-ATAC accurately infers the immune contexture in a bulk ATAC-Seq breast cancer cohort.

A) Proportions of different cell types predicted by EPIC-ATAC in the samples stratified based on two breast cancer subtypes. B) Proportions of different cell types predicted by EPIC-ATAC in the samples stratified based on three ER+/HER2-subgroups. Wilcoxon test p-values are represented at the top of the boxplots.

We then compared the cellular composition of ER+/HER2-subgroups identified in the original study (clusters CA-A, CA-B and CA-C). A higher infiltration of T and B cells was observed in cluster CA-C and higher proportions of endothelial cells and fibroblasts were observed in cluster CA-B (Figure 6B). These predictions are consistent with the infiltration level estimations reported in the original publication, although no differences in macrophages infiltration was observed between the ER+/HER2-subgroups in our case (Kumegawa et al. 2023).

EPIC-ATAC performs similarly to EPIC RNA-seq based deconvolution and better than gene activity based deconvolution

We finally compared the accuracy of EPIC when applied on ATAC-Seq data and on RNA-Seq data. For this purpose, we used the 10X multiome PBMC dataset (10x Genomics 2021) which provides for each cell both its chromatin accessibility profile and its gene expression profile and simulated 100 pseudobulks with diverse cellular compositions (see Material and methods). We used EPIC-ATAC to perform ATAC-Seq based deconvolution on the chromatin accessibility levels of the peaks and the original EPIC tool to perform standard RNA-seq deconvolution on the gene expression levels. ATAC-Seq peaks can also be aggregated, based on peak distances to each gene, into gene activity (GA) variables as proxy for gene expression. We thus applied the GA transformation to the 10x multiome PBMC dataset and performed GA-based RNA deconvolution using the original EPIC tool (See Material and methods). Figure 7 shows that EPIC-ATAC performs similarly to the EPIC RNA-seq based deconvolution and outperforms the GA-based RNA deconvolution. The lower performances of GA based RNA deconvolution could be explained by the fact that GA features, by construction, do not perfectly match the transcriptomic data.

EPIC-ATAC performs similarly to EPIC RNA-seq based deconvolution and better than gene activity based deconvolution.

Pearson’s correlation (left) and RMSE (right) values comparing the proportions predicted by the ATAC-Seq deconvolution, the RNA-Seq deconvolution and the GA-based RNA deconvolution and true cell-type proportions in the 100 pseudobulks simulated form the 10x multiome PBMC dataset (10x Genomics 2021). Dots correspond to outlier pseudobulks.

Discussion

Bulk chromatin accessibility profiling of biological tissues like tumors represents a reliable and affordable technology to map the activity of gene regulatory elements across multiple samples in different conditions. Here, we collected ATAC-Seq data from pure cell populations covering major immune and non-immune cancer-relevant cell types from diseased, stimulated and healthy samples. This enabled us to identify reliable cell-type specific marker peaks and chromatin accessibility profiles for both PBMC and solid tumor sample deconvolution. We integrated these data in the EPIC deconvolution framework to accurately predict the fraction of both malignant and non-malignant cell types from bulk tumor ATAC-Seq samples.

In cases where specific cell types are expected in a sample but are not part of our list of reference profiles (e.g., neuronal cells in brain tumors), custom marker peaks and reference profiles can be provided to EPIC-ATAC to perform cell-type deconvolution and we provide the code to generate such markers and profiles based on ATAC-Seq data from sorted cells, following the approach developed in this work (Figure 1, see Code availability).

Solid tumors contain large and heterogeneous fractions of cancer cells for which it is challenging to build reference profiles. To our knowledge this work provides the first benchmark of deconvolution tools adapted to ATAC-Seq data in the context of solid tumor samples. We show that the EPIC-ATAC framework, in contrast to other existing tools, allows users to accurately predict the proportion of cells not included in the reference profiles (Figure 4 and Supplementary Figure 4). These uncharacterized cells can include cancer cells but also other non-malignant cells. Since the major cell types composing TMEs were included in our reference profiles, the proportion of uncharacterized cells approximates the proportion of the cancer cells in most cases.

The pseudobulk approach provides unique opportunities to design benchmarks with known cell-type proportions but also comes with some limitations. Indeed, pseudobulks are generated from single-cell data which are noisy and whose cell-type annotation is challenging in particular for closely related cell types. These limitations might lead to chromatin accessibility profiles that deviates from true bulk data and errors in the true cell-type proportions. For this reason, we anticipate that the newly generated benchmarking PBMCs dataset with ground truth cell proportions obtained by flow cytometry will nicely complement pseudobulk from scATAC-Seq data in future benchmarks of ATAC-Seq deconvolution. The qualitative evaluation of our method on true bulk ATAC-Seq samples from breast cancer patients and the observation of similar immune compositions in TN and ER+/HER2-samples as the ones identified in the original paper (Figure 6) further support the accuracy of EPIC-ATAC to deconvolve bulk ATAC-Seq data, without requiring additional scATAC-Seq data which are not always available for all cancer types.

Overall the evaluation of the EPIC-ATAC deconvolution resulted in an average absolute error of 7% across cell types. This number is consistent with previous observations in RNA-Seq data deconvolution (Racle et al. 2017). Considering this uncertainty, the quantification of low frequency populations remains challenging (Jin and Liu 2021). While the estimated proportions of these populations by EPIC-ATAC are low (e.g., dendritic cells), comparing such estimations across samples should be performed with care due to the uncertainty of the predictions.

Another limitation of cell-type deconvolution is often reached when closely related cell types are considered. In the reference-based methods used in this study, this limit was reached when considering T-cell subtypes in the reference profiles (Figure 5 and Supplementary figure 8). We thus recommend to use the EPIC-ATAC framework using the markers and reference profiles based on the major cell populations. We additionally provide the marker peaks of the T-cell subtypes which could be used to build cell-type specific chromatin accessibility signatures or perform “peak set enrichment analysis” similarly to gene set enrichment analysis (GSEA, (Subramanian et al. 2005)). Such application could be useful for the annotation of scATAC-Seq data, which often relies on matched RNA-Seq data and for which there is a lack of markers at the peak level (Jiang et al. 2023).

Another possible application of our marker peaks relies on their annotation (Figure 2G, Supplementary Tables 4-5), which could be used to expand the list of genes and CBPs associated to each cell type or subtype. For example, the neutrophils marker peaks were enriched for motifs of TFs such as SPI1 (Supplementary Table 4), which was not listed in the neutrophil genes in the databases used for annotation but has been reported in previous studies as involved in neutrophils development (Watt et al. 2021). The annotations related to the set of major cell types and T-cell subtypes are provided in Supplementary Tables 4-5 and 10-11. Finally, the annotation of marker peaks highlighted pathways involved in immune responses to tumoral environments (Figure 2G). Examples of these pathways are the toll-like receptor signaling pathway involved in pathogen-associated and recognition of damage-associated molecular patterns in diverse cell types including B and T cells (Geng et al. 2010; Javaid and Choi 2020), glucan metabolic processes which are known to be related to trained immunity which can lead to anti-tumor phenotype in neutrophils (Kalafati et al. 2020) or the Fc-receptor signaling observed in NK cells (Sanseviero 2019; Bonnema et al. 1994). These observations suggest that our marker peaks contain regulatory regions not only specific to cell types but also adapted to the biological context of solid tumors.

Conclusion

In this work, we identified biologically relevant cell-type specific chromatin accessibility markers and profiles for all major cancer-relevant cell types. We capitalized on these markers and profiles to predict cell-type proportions from bulk PBMC and solid tumor ATAC-Seq data (https://github.com/GfellerLab/EPIC-ATAC). Evaluated on diverse tissues, EPIC-ATAC shows reliable predictions of immune, stromal, vascular and cancer cell proportions. With the expected increase of ATAC-Seq studies in cancer, the EPIC-ATAC framework will enable researchers to deconvolve bulk ATAC-Seq data from tumor samples to support the analysis of regulatory processes underlying tumor development, and correlate the TME composition with clinical variables.

Materials and methods

Generation of an ATAC-Seq reference dataset of cancer relevant cell types

Pre-processing of the sorted ATAC-Seq datasets

We collected pure ATAC-Seq samples from 12 studies. The data include samples from (i) ten major immune, stromal and vascular cell types (B (Calderon et al. 2019; Corces et al. 2016; P. Zhang et al. 2022), CD4+ (Corces et al. 2016; Liu et al. 2020; P. Zhang et al. 2022; Mumbach et al. 2017; Giles et al. 2022), CD8+ (Calderon et al. 2019; Corces et al. 2016; Liu et al. 2020; P. Zhang et al. 2022; Giles et al. 2022), natural killer (NK) (Calderon et al. 2019; Corces et al. 2016), dendritic (DCs) cells (Calderon et al. 2019; Leylek et al. 2020; Liu et al. 2020), macrophages (Liu et al. 2020; P. Zhang et al. 2022), monocytes (Calderon et al. 2019; Corces et al. 2016; Leylek et al. 2020; P. Zhang et al. 2022; Trizzino et al. 2021) and neutrophils (Ram-Mohan et al. 2021; Perez et al. 2020) as well as fibroblasts (Ge et al. 2021; Liu et al. 2020) and endothelial (Liu et al. 2020; Xin et al. 2020) cells (See Figure 2A), and (ii) eight tissues from distinct organs (i.e bladder, breast, colon, liver, lung, ovary, pancreas and thyroid) from the ENCODE data (The ENCODE Project Consortium et al. 2020; Rozowsky et al. 2023). The list of the samples and their associated metadata (including cell types and accession number of the study of origin) is provided in Supplementary Table 1. To limit batch effects, the samples were reprocessed homogeneously from the raw data (fastq files) processing to the peak calling. For that purpose, raw fastq files were collected from GEO using the SRA toolkit and the PEPATAC framework (Smith et al. 2021) was used to process the raw fastq files based on the following tool: trimmomatic for adapter trimming, bowtie2 (with the PEPATAC default parameters) for reads pre-alignment on human repeats and human mitochondrial reference genome, bowtie2 (with the default PEPATAC parameters: --very-sensitive-X 2000) for alignment on the human genome (hg38), samtools (PEPATAC default parameters: -q 10) for duplicates removal and MACS2 (Y. Zhang et al. 2008) (PEPATAC default parameters: --shift-75--extsize 150 --nomodel --call-summits --nolambda --keep-dup all -p 0.01) for peak calling in each sample. After alignment, reads mapping on chromosome M were excluded. TSS enrichment scores were computed for each sample and used to filter out samples with low quality (criteria of exclusion: TSS score < 5) (See Supplementary Table 1 containing the TSS score of each sample). 789 samples (including 564 from our ten reference cell-types) had a TSS score > 5.

Generation of a consensus set of peaks

Peak calling was performed in each sample individually. Peaks were then iteratively collapsed to generate a set of reproducible peaks. For each cell type, peaks collapse was performed adapting the iterative overlap peak merging approach proposed in the PEPATAC framework. A first peaks collapse was performed at the level of each study of origin, i.e., if peaks identified in distinct samples overlapped (minimum overlap of 1bp between peaks), only the peak with the highest peak calling score was kept. Also, only peaks detected in at least half of the samples of each study were considered for the next step. If a study had only two samples, only peaks detected in both samples were considered. After this first selection, a second round of peaks collapse was performed at the cell-type level to limit batch effects in downstream analyses. For each cell type, only peaks detected in all the studies of origin were considered. The final list of peaks was then generated by merging each set of reproducible peaks. Peaks located on chromosome Y were excluded from the rest of the analyses. ATAC-Seq counts were retrieved for each sample and each peak using featureCounts (Liao, Smyth, and Shi 2014).

Identification of cell-type specific markers

Differential accessibility analysis

To identify cell-type specific markers, we split the samples collection in ten folds (created with the create_folds function from the R package splitTools (Mayer 2023)). For each fold, we performed pairwise differential accessibility analysis across the ten cell types considered in the reference samples as well as the ENCODE samples from diverse organs. The differential analysis was performed using limma ((Ritchie et al. 2015), version 3.56.2). Effective library sizes were computed using the method of trimmed mean of M-values (TMM) from the edgeR package in R ((Robinson, McCarthy, and Smyth 2010), version 3.42.4). Due to differences of library size across all samples collected, we used voom from the limma package (Law et al. 2014) to transform the data and model the mean-variance relationship. Finally, a linear model was fitted to the data to assess the differential accessibility of each peak across each pair of cell types. To identify our marker peaks, all peaks with log2 fold change higher than 0.2 were selected and ranked by their maximum adjusted p-value across all pairwise comparisons. The top 200 features (with the lowest maximum adjusted p-value) were considered as cell-type specific marker peaks. The marker peaks identified in at least three folds were considered in the final list of marker peaks.

Marker peaks filtering

Modules of open chromatin regions accessible in all (universal modules) or in specific human tissues have been identified in the study Zhang et al. (K. Zhang et al. 2021). These regions were used to refine the set of marker peaks and exclude peaks with residual accessibility in other cell types than those considered for deconvolution. More precisely, for immune, endothelial and fibroblasts specific peaks, we filtered out the peaks overlapping the universal modules as well as the tissue specific modules except the immune (modules 8 to 25), endothelial (modules 26 to 35) and stromal related modules (modules 41 to 49 and 139-150) respectively. As a second filtering step, we retained markers exhibiting the highest correlation patterns in tumor bulk samples from different cancer types, i.e., The Cancer Genome Atlas (TCGA) samples (Corces et al. 2018). We used the Cancer Genomics Cloud (CGC) (Lau et al. 2017) to retrieve the ATAC-Seq counts for each marker peaks in each TCGA sample (using featureCounts). For each set of cell-type specific peaks, we identified the most correlated peaks using the findCorrelation function of the caret R package ((Kuhn 2008), version 6.0-94) with a correlation cutoff value corresponding to the 90th percentile of pairwise Pearson correlation values.

Evaluation of the study of origin batch effect

To identify potential batch effect issues, we run principal component analysis (PCA) based on the cell-type specific peaks after normalizing ATAC-Seq counts using full quantile normalization (FQ-FQ) implemented in the EDASeq R package (Risso et al. 2011) to correct for depth and GC biases. These data were used to visualize the data in two-dimensional space running Uniform Manifold Approximation (UMAP) based on the PBMC and TME markers (Figure 2B). We also run PCA and used the ten first principal components to evaluate distances between samples and compute silhouette coefficients based on the cell type and study of origin classifications.

Building the reference profiles

It has been previously demonstrated in the context of RNA-Seq based deconvolution approaches (Racle et al. 2017; Sturm et al. 2019) that the transcripts per million (TPM) transformation is appropriate to estimate cell fractions from bulk mixtures. We thus normalized the ATAC-Seq counts of the reference samples using a TPM-like transformation, i.e., dividing counts by peak length, correcting samples counts for depth and rescaling counts so that the counts of each sample sum to 106. We then computed for each peak the median of the TPM-like counts across all samples from each cell type to build the reference profiles of the ten cell types considered in the EPIC-ATAC framework (Figure 2C). In the EPIC algorithm, weights reflecting the variability of each feature of the reference profile can be considered in the constrained least square optimization. We thus also computed the inter-quartile range of the TPM-like counts for each feature in each cell type. Two ATAC-Seq reference profiles are available in the EPIC-ATAC framework: (i) a reference profile containing profiles for B cells, CD4+ T cells, CD8+ T cells, NK, monocytes, dendritic cells and neutrophils to deconvolve PBMC samples, and (ii) a reference profile containing profiles for B cells, CD4+ T cells, CD8+ T cells, NK, dendritic cells, macrophages, neutrophils, fibroblasts and endothelial cells to deconvolve tumor samples. The reference profiles are available in the EPICATAC R package and the reference profiles restricted to our cell-type specific marker peaks are available in the Supplementary Tables 2 and 3.

Assessing the reproducibility of the marker peaks signal in independent samples

We evaluated the chromatin accessibility level of the marker peaks in samples that were not included in the peak calling step. Firstly, we considered samples from two independent studies (Ucar et al. 2017; Carvalho et al. 2021) providing pure ATAC-Seq data for five immune cell types (i.e., B, CD4+ T cells, CD8+ T cells, Monocytes, Macrophages) (Figure 2D). To consider the other cell types, samples that were excluded from the reference dataset due to a low TSS enrichment score were also considered in this validation dataset (Supplementary Table 1). Secondly, we collected the data from a single-cell atlas chromatin accessibility from human tissues and considered the cell types included in our reference data (K. Zhang et al. 2021) (Figure 2E). We used the cell-type annotations provided in the original study (GEO accession number: GSE184462). The Signac R package ((Stuart et al. 2021), 1.9.0) was used to extract fragments counts for each cell and each marker peak and the ATAC-Seq signal of each marker peak was averaged across all cells of each cell type.

Annotation of the marker peaks

The cell-type specific markers were annotated using ChIPseeker R package ((Yu, Wang, and He 2015), version 1.34.1) and the annotation from TxDb.Hsapiens.UCSC.hg38.knownGene in R to identify the regions in which the marker peaks are (i.e., promoter, intronic regions, etc.) and ChipEnrich to associate each peak to the nearest gene TSS (Welch et al. 2014). The nearest genes identified were then compared to cell-type marker genes listed in the PanglaoDB (Franzén, Gan, and Björkegren 2019) and CellMarker databases (Hu et al. 2023). PanglaoDB provides an online interface to explore a large collection on single-cell RNA-Seq data as well as a community-curated list of cell-type marker genes. CellMarker is a database providing a large set of curated cell-type markers for more than 400 cell types in human tissues retrieved from a large collection of single-cell studies and flow cytometry, immunostaining or experimental studies. ChipEnrich was also used to perform gene set enrichment and identify for each set of cell-type specific peaks potential biological pathways regulated by the marker peaks. The enrichment analysis was performed using the chipenrich function (genesets = “GOBP”, locusdef = “nearest_tss”) from the chipenrich R package (v2.22.0). Chromatin accessibility peaks can also be annotated for chromatin binding proteins (CBPs) such as transcription factors (TFs), whose potential binding in the peak region is reported in databases. In our study we chose the JASPAR2022 (Castro-Mondragon et al. 2022) database and the ReMap database (Hammal et al. 2022).

Using the JASPAR2022 database, we assessed, for each cell type, whether the cell-type specific marker peaks were enriched in specific TFs motifs using two TFs enrichment analysis frameworks: Signac (Stuart et al. 2021) and MonaLisa (Machlab et al. 2022). For the MonaLisa analysis, the cell-types specific markers peaks were categorized in bins of sequences, one bin per cell type (use of the calcBinnedMotifEnrR function). To test for an enrichment of motifs, the sequences of each bin were compared to a set of background peaks with similar average size and GC composition obtained by randomly sampling regions in all the peaks identified from the reference dataset. The enrichment test was based on a binomial test. For the Signac analysis, we used the FindMotif function to identify over-represented TF motifs in each set of cell-type specific marker peaks (query). This function used a hypergeometric test to compare the number of query peaks containing the motif with the total number of peaks containing the motif in the background regions (matched to have similar GC content, region length and dinucleotide frequencies as the query regions), corresponding in our case to the peaks called in the reference dataset.

The ReMap database associates chromatin binding proteins (CBPs), including TFs, transcriptional coactivators and chromatin-remodeling factors, to their DNA binding regions based on DNA-binding experiments such as chromatin immunoprecipitation followed by sequencing (ChIP-seq). For each association of a CBP to its binding region, the cell type in which the binding has been observed is reported in the ReMap database (biotype). We used the ReMapEnrich R package (version 0.99) to test if the cell-type specific marker peaks are significantly enriched in CBPs-binding regions listed in the Remap 2022 catalog. We considered the non-redundant peaks catalog from Remap 2022, containing non-redundant binding regions for each CBP in each biotype. Similarly to the previously mentioned enrichment methods, we chose the consensus peaks called in the reference samples as universe for the enrichment test. Note that, for each cell type, an enrichment was retained only if the biotype in which the CBP-regions were identified matched the correct cell-type.

Running EPIC-ATAC on bulk ATAC-Seq data

The samples used to generate the reference profiles were aligned using the hg38 reference genome. To assure the compatibility of any input bulk ATAC-Seq dataset with the EPIC-ATAC marker peaks and reference profiles, we provide an option to lift over hg19 datasets to hg38 (use of the liftOver R package). Subsequently, the features of the input bulk matrix are matched to our reference profiles features. To match both sets of features, we determine for each peak of the input bulk matrix the distance to the nearest peak in the reference profiles peaks. Overlapping regions are retained and the feature IDs are matched to their associated nearest peaks. If multiple features are matched to the same reference peak, the counts are summed. In RNA-Seq based deconvolution, EPIC uses an estimation of the amount of mRNA in each reference cell type to derive cell proportions. For the ATAC-Seq based deconvolution these values were set to 1 to give similar weights to all cell-types quantifications.

Datasets used for the evaluation of ATAC-Seq deconvolution

PBMCs ATAC-Seq data from healthy donors

Peripheral blood mononuclear cell (PBMC) isolation

Venous blood from five healthy donors was collected at the local blood transfusion center of Geneva in Switzerland, under the approval of the Geneva University Hospital’s Institute Review Board, upon written informed consent and in accordance with the Declaration of Helsinki. PBMCs were freshly isolated by Lymphoprep (Promega) centrifugation (1800 rpm, 20 minutes, without break, room temperature). Red blood cell lysis was performed using red blood lysis buffer (Qiagen) and platelets were removed by centrifugation (1000 rpm, 10 minutes without break, room temperature). Cells were counted and immediately used.

Flow cytometry

Immune cell populations were identified using multiparameter flow cytometry and the following antibodies: FITC anti-human CD45RA (HI100, Biolegend), PerCP-Cyanine5.5 anti-human CD19 (H1B19, Biolegend), PE anti-human CD3 (SK7, Biolegend), PE-Dazzle anti-human CD14 (M0P9, BD Biosciences), PE-Cyanine7 anti-human CD56 (HCD56, Biolegend), APC anti-human CD4 (RPA-T4, Biolgend), APC-Cyanine7 anti-human CCR7 (G043H7, Biolegend), Brilliant Violet 421 anti-human CD8 (RPA-T8, Biolegend), Brilliant Violet 510 anti-human CD25 (BC96, Biolegend), Brilliant Violet 711 anti-human CD16 (3G8, Biolegend), Brilliant Violet 786 anti-human CD127 (A019D5, Biolegend), Ultra-Brilliant Violet anti-human CD45 (HI30, BD Biosciences), FITC anti-human Celc9a (8F9, Miltenyi), PE anti-human XCR1 (S15046E, Biolegend), PE-Dazzle anti-human BDCA-2 (201A, Biolegend), APC anti-human BDCA-3 (AD5-14H12, Miltenyi), Brilliant Violet 421 anti-human CD3 (UCHT1, Biolegend), Brilliant Violet 421 anti-human CD14 (M5E2, BD Pharmingen), Brilliant Violet 421 anti-human CD19 (SJ25C1, Biolegend), Brilliant Violet 510 anti-human BDCA-1 (L161, Biolegend), Brilliant Violet 650 anti-human CD11c (3.9, Biolegend), Brilliant Violet 711 anti-human CD11c (N418, Biolegend) and Brilliant Violet 711 anti-human HLA-DR (L243, Biolegend). Dead cells were excluded using the Zombie UV™ Fixable Viability Kit (Biolegend). Intracellular staining was performed after fixation and permeabilization of the cells with the FoxP3 Transcription Factor Staining Buffer Set (00-5523-00, Invitrogen) using Alexa 700 anti-human FoxP3 antibody (259D/C7, BD Biosciences). Data were acquired on LSRFortessa flow cytometer and analysed using FlowJo software (v10.7.1).

Cell preparation for ATAC-Sequencing

50000 CD45+ cells were sorted from total PBMCs using anti-human Ultra-Brilliant Violet (BUV395) CD45 (HI30, BD Biosciences) with a FACSAria II (Becton Dickinson) and were collected in PBS with 10% Foetal Bovine Serum (FBS). Cell pellets were resuspended in cold lysis buffer (10mM Tris-Cl pH 7.4, 10mM NaCl, 3mM MgCl2, 0,1% NP40 and water) and immediately centrifuged at 600g for 30min at 4°C. Transposition reaction was performed using the Illumina Tagment DNA Enzyme and Buffer kit (20034210, Illumina) and transposed DNA was eluted using the MinElute PCR Purification Kit (Qiagen). Libraries were generated by PCR amplification using indexing primers and NEBNext High-Fidelity Master Mix (New England BioLabs) and were purified using AMPure XP beads (A63880, Beckman Coulter). Libraries were quantified by a fluorometric method (QubIT, Life Technologies) and their quality assessed on a Fragment Analyzer (Agilent Technologies). Sequencing was performed as a paired end 50 cycles run on an Illumina NovaSeq 6000 (v1.5 reagents) at the Genomic Technologies Facility (GTF) in Lausanne, Switzerland. Raw sequencing data were demultiplexed using the bcl2fastq2 Conversion Software (version 2.20, Illumina).

Data processing

The same steps as for the processing of the reference ATAC-Seq samples were followed. (See Pre-processing of the ATAC-Seq datasets).

ATAC-Seq pseudobulk data from PBMCs and cancer samples

To evaluate the accuracy of our ATAC-Seq deconvolution framework, we generated pseudo-bulk datasets from 5 single-cell datasets:

  • PBMC pseudobulk dataset: combination of three single-cell datasets for PBMCs.

    • Dataset 1 corresponds to a scATAC-Seq dataset obtained from Satpathy et al. (Satpathy et al. 2019) (GEO accession number: GSE129785). This dataset contains FACS-sorted populations of PBMCs. Since the cells of some cell types came from a unique donor, all the cells of this dataset were aggregated to form one pseudobulk. Ground truth cell fractions were obtained by dividing the number of cells in each cell type by the total number of cells.

    • Dataset 2 (included in the PBMC pseudobulk dataset) was retrieved from Granja et al. (Granja et al. 2019) (GEO accession number GSE139369). B cells, monocytes, dendritic, CD8+, CD4+ T, NK cells, neutrophils from healthy donors were considered. The neutrophil cells came from a single donor. As for dataset 1, we thus aggregated all the cells to generate one pseudobulk. Ground truth cell fractions were obtained by dividing the number of cells in each cell type by the total number of cells.

    • Dataset 3 (included in the PBMC pseudobulk dataset) corresponds to the 10X multiome dataset of PBMC cells (10x Genomics 2021). Since these data come from one donor, one pseudobulk sample was generated for this dataset. The pseudobulk was generated by averaging the ATAC-Seq signal from all cells from the following cell types: B cells, CD4+ T cells, CD8+ T cells, NK cells, Dendritic cells and monocytes.

  • Basal cell carcinoma dataset: obtained from the study of Satpathy et al. (Satpathy et al. 2019). This dataset is a scATAC-Seq dataset composed of 13 basal cell carcinoma samples composed of immune (B cells, plasma cells, CD4+ T cells, CD8+ T cells, NK cells, myeloid cells), stromal (endothelial and fibroblasts) and cancer cells. Plasma cells and cancer cells were both considered as uncharacterized cells (i.e., cell types not included in the reference profiles). Cell annotations were retrieved from the original study.

  • Gynecological cancer dataset: obtained from the study of Regner et al. (Regner et al. 2021) (GEO accession number GSE173682). In this study, the authors performed scATAC-Seq on 11 gynecological cancer samples from two tumor sites (i.e endometrium and ovary) and composed of immune (B cells, NK and T cells grouped under the same cell-type annotation, macrophages, mast cells), stromal (fibroblast, endothelial, smooth muscle) and cancer cells. Mast cells, smooth muscle and cancer cells were considered as uncharacterized cells. Cell annotations were retrieved from the original study.

For Basal cell carcinoma and Gynecological cancer datasets, one pseudobulk per sample was generated and ground truth cell fractions were obtained for each sample by dividing the number of cells in each cell type by the total number of cells in the sample.

For each dataset, raw fragments files were downloaded from the respective GEO accession numbers and data were preprocessed using ArchR ((Granja et al. 2021), ArchR R package 1.0.2). Cells with TSS score below four were removed. Doublets removal was performed using the doubletsRemoval function from ArchR. To match as much as possible real bulk ATAC-seq data processing, peak calling was not performed on each cell type or cell cluster as usually done in scATAC-Seq studies but using all cells for each dataset from the PBMC pseudobulk data or grouping cells by sample for the Basal cell carcinoma and Gynecological cancer datasets. Peak calling was performed using MACS2 within the ArchR framework. Fragments counts were extracted using ArchR for each peak called to generate single-cell peak counts matrices. These matrices were normalized using a TPM-like transformation, i.e., dividing counts by peak length and correcting samples counts for depth. Finally, for each peak, the average of the normalized counts was computed across all the cells for each dataset from the PBMC pseudobulk data and across all the cells of each sample for the Basal cell carcinoma and Gynecological cancer datasets. Averaged data were then rescaled so that the sum of counts of each sample sum to 106.

Bulk ATAC-Seq data from a breast cancer cohort

Bulk ATAC-Seq samples from a breast cancer cohort was obtained from Kumegawa et al. (Kumegawa et al. 2023). These data include 42 breast cancer samples which can be classified based on two features: (i) the breast cancer subtype ER+/HER2- or triple negative, and (ii) the molecular classification provided by the original study (CA-A, CA-B and CA-C). The ATAC-Seq raw counts and the samples metadata were retrieved from figshare (Kumegawa 2023). As for the previously mentioned datasets, raw counts were normalized using the TPM-like transformation prior to bulk deconvolution.

Benchmarking of the EPIC-ATAC framework against other existing deconvolution tools

The performances of the EPIC-ATAC framework were benchmarked against the following deconvolution tools:

  • quanTIseq (Finotello et al. 2019) is a deconvolution tool using constrained least square regression to deconvolve RNA-Seq bulk samples. No reference profiles are available in this framework to perform ATAC-Seq deconvolution and quanTIseq does not provide the option to automatically build reference profiles from pure bulk samples. quanTIseq was thus run using the reference profiles derived in this work for the EPIC-ATAC framework and the quanTIseq function from the quantiseqr R package (parameters: scaling set to 1 for all cell types and method set to “lsei”).

  • DeconPeaker (H. Li et al. 2020) relies on SIMPLS, a variant of partial least square regression to perform bulk RNA-Seq and bulk ATAC-Seq deconvolution. ATAC-Seq reference profiles are available in this deconvolution framework however not all cell types considered in the EPIC-ATAC framework are included in the DeconPeaker reference profiles. This tool was thus run using different reference profiles: (i) the reference profiles derived in this work for the EPIC-ATAC framework (corresponds to “DeconPeaker” or “DeconPeaker_ourmarkers” in our analyses), and (ii) reference profiles automatically generated by DeconPeaker from the sorted reference samples collected in this work (corresponds to “DeconPeaker_cust.” in our analyses). The results of DeconPeaker obtained using its original markers and profiles are also provided for the cell types in common with the cell types considered in this work in Supplementary Figures 3 and 5. Deconvolution was run using the deconvolution module deconPeaker (using findctsps with the following parameter: --lib-strategy=ATAC-Seq). DeconPeaker outputs cell-type proportions relative to the total amount of cells from the reference cell types.

  • CIBERSORTx (Newman et al. 2019) is a deconvolution algorithm based on linear support vector regression. CIBERSORTx does not provide ATAC-Seq reference profiles, however it is possible to automatically generate new profiles from a set of pure bulk samples. This tool was thus run using different reference profiles: i) the reference profiles derived in this work for the EPIC-ATAC framework (corresponds to “CIBERSORTx” or “CIBERSORTx_ourMarkers” in our analyses), and ii) reference profiles automatically generated by CIBERSORTx from the sorted reference samples collected in this work (corresponds to “CIBERSORTx_cust.” in our analyses). To run CIBERSORTx, we used the docker container provided by the authors of CIBERSORTx on their website. The algorithm was run using the default options (i.e --absolute FALSE, --rmbatchBmode FALSE and –rmbatchSmode FALSE), which results in cell-type proportions relative to the total amount of cells from the reference cell types.

  • ABIS (Monaco et al. 2019) uses robust linear modeling to estimate cell-type proportions in bulk RNA-Seq samples. No ATAC-Seq reference profiles are available in the deconvolution framework. ABIS was run using the EPIC-ATAC reference profiles by using the rlm function from the MASS R package (as performed in the deconvolute_abis function from the immunedeconv R package (Sturm et al. 2019) was used to quantify each cell type from the reference profiles. The cell-types quantifications returned by this approach are in arbitrary units. To compare the estimations and the true cell proportions, we scaled the estimations of each sample between 0 and 1 to obtained relative proportions.

  • MCPcounter (Becht et al. 2016): MCPcounter returns scores instead of cell type proportions.

The scores were obtained using the appendSignatures function from the MCPcounter R package by providing the list of marker peaks specific to each cell type. The cell-type scores are not comparable between cell type, MCPcounter was thus included only in the evaluation of the performances in each cell type separately.

For all the tools, TPM-like data were used as input bulk samples for the deconvolution. Since CIBERSORTx, ABIS and DeconPeaker do not predict proportions of uncharacterized cells, we performed two benchmarking analyses: (i) including all cell types and (ii) excluding the cell types that are absent from the reference profiles (uncharacterized cells) and rescaling the estimated and true proportions of the immune cells, endothelial cells and fibroblasts so that their sum equals 1.

Comparing deconvolution based on RNA-Seq, gene activity or peaks features

100 pseudobulks were generated from the 10X PBMC multiome dataset (10x Genomics 2021) based on 3000 cells for each pseudobulk. Cell fractions were defined using the rdirichlet function from the gtools R package. Three sets of features were extracted from the data, i.e., gene expression features extracted from the RNA-Seq layer, ATAC-Seq peaks and gene activity derived from the ATAC-Seq layer. The same cells sampling was considered for each modality.

Gene activity features were extracted from the single-cell data using ArchR (1.0.2), which considers distal elements and adjusts for large differences in gene size in the gene activity score calculation. Gene activity pseudobulks were built by averaging the gene activity scores across all cells belonging to the pseudobulk. For ATAC-Seq pseudobulk, peaks called using ArchR on all cells form the 10X dataset were considered (see the method section “ATAC-Seq pseudobulk data from PBMCs and cancer samples”) and counts were averaged across all cells of each pseudobulk. For RNA-Seq pseudobulks, counts were also averaged across all cells of each pseudobulk. All aggregated data were depth normalized across each features to 106. Cell-type deconvolution was performed on each pseudobulk using EPIC-ATAC on the peak matrix using our ATAC-Seq marker peaks and reference profiles. The RNA-Seq and gene activity pseudobulks were deconvolved with EPIC.

Code availability

The code to download and preprocess publicly available ATAC-Seq samples as well as the code used to identify our cell-type specific marker peaks and generate the reference profiles is available on GitHub (https://github.com/GfellerLab/EPIC-ATAC_Manuscript). A README file is provided on the GitHub repository with more details on how to use the code.

The code to perform ATAC-Seq deconvolution using the EPIC-ATAC framework is available as an R package called EPICATAC and is available on GitHub (https://github.com/GfellerLab/EPIC-ATAC).

Data availability

The newly generated ATAC-Seq data have been deposited on Zenodo (doi: 10.5281/zenodo.8431792). The other data related to this work are available in the supplementary tables and on the Zenodo deposit (doi: 10.5281/zenodo.8431792).

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

We thank the Lausanne Genomic Technologies Facility, University of Lausanne, Switzerland (https://www.unil.ch/gtf/en/home.html) for the sequencing of the PBMC samples as well as Yan Liu, Dana Moreno and Matei Teleman for testing the EPICATAC R package. Some of the illustrations were created with BioRender.com.

Authors contributions

Conceptualization: AAGG, JR, DG; Data curation: AAGG; Software: AAGG, JR, DG; Experiments: MF, CJ; Visualization: AAGG, JR, DG; Methodology: AAGG, JR, DG; Writing—original draft: AAGG, DG; Writing—review and editing: all authors.

List of abbreviations

  • ATAC: Assay for Transposase-Accessible chromatin

  • CBP(s): chromatin binding protein(s)

  • ChIP-seq: chromatin immunoprecipitation followed by sequencing

  • DC: dendritic cells

  • NK: natural killer cells

  • PCA: principal component analysis

  • RMSE: root mean squared error

  • TCGA: The Cancer Genome Atlas

  • TF(s): transcription factor(s)

  • TSS: transcription start site

  • UMAP: Uniform Manifold Approximation