Wednesday, June 15, 2016

Spring Work

Quantitative Modeling of Signaling Interactions in the Breast Tumor Microenvironment
Specific Aims

Aim 1: Identify sets of stromal-epithelial interactions that suggest candidate hypotheses for the causal mechanisms through which stroma expression levels can affect tumor functions.

Aim 2: Identify modules in the stromal-epithelial signaling network that are conserved between mouse and human, supporting the assignment of degrees of confidence to predictions regarding human disease made based on mouse models.

Aim 3: Integrate candidate causal mechanisms through which stroma expression levels can affect tumor functions into one or more stochastic models and use these to develop statistical measures that can distinguish or reject these alternative hypotheses based on experimental data that can be feasibly obtained.

Data I was given

I was given mouse mammary transcriptome data. Breast tissue had been collected from 8 week old virgin Fvb/N mice, under four conditions: either wild type ErbB2 or mutant ErbB2, combined with either normal PTEN or conditional knockout of PTEN in the fibroblasts only.  This tissue had been separated into four cell types: epithelial cells, fibroblasts, macrophages, and endothelial cells. Then microarrays were used to probe the expression levels of 17548 genes. Among different probes for the same gene, the one with the maximum value had been chosen as the representative entry in the data that I have.  If different probes reflected different isoforms of the same gene, this information has been lost in the file I have.

I was given mouse mammary fibroblast secretome data.  Mouse mammary fibroblasts, with either normal PTEN or PTEN knockout, were cultured. The medium was collected and proteomics were run to determine which proteins were present. I was given the names of 67 genes whose protein products were secreted in both the normal and PTEN knockout conditions, and 54 other genes whose protein products were detected only in the PTEN knockout condition.

These are the two datasets to be used for Aim 1.

I was given gene expression data from tumor epithelial cells and from matched adjacent stromal cells from 123 human breast cancer patients.  In this case the expression levels were obtained by averaging the microarray probes in those cases where there were multiple probes (as I interpret the note: “Technical replicates averaged. There are multiple probes for some genes.”)

I was given exome data from 21 of the human breast cancer patients, of which 12 had matched tumor and normal samples and 9 had only tumor samples.

These are the two datasets to be used in conjunction with the previous two, for Aim 2.  The results of Aims 1 and 2 are to be integrated for Aim 3.

Most of the transcripts in the mouse transcriptome data have protein products and some are noncoding RNAs. The same is true of the patient transcriptome data.

What I have done

I have mapped the gene names used in the mouse transcriptome data to currently used gene names.  I resolved most of these by querying MGI. I resolved some by following up on deprecated Ensembl transcript ids, resolving genes with two possible current symbols using the chromosomal location in the data file,  following the history of changed Entrez gene ids, looking on Vesiclepedia, and using the Ensembl transcript ids. I mapped most of the genes to Ensembl gene ids and Uniprot accessions. The gene Hist2h2aa1 appeared twice in the transcriptome data files, on the + and – strands of chr3.  I edited my copy of the data files to reflect that Hist2h2aa1 is on the + strand and Hist2h2aa2 is on the – strand.

I need to similarly update the nomenclature for the human genes in the McGill transcriptome data.

I extracted the human orthologs of the mouse genes from the Ensembl ortholog files, as Ensembl protein ids for proteins and Ensembl transcript ids for genes.

I ran the first step of the exome analysis pipeline I was given, FastQC, on the exome data. I looked at the results of FastQC (although this did not seem to be part of the protocol) and found that some sort of unidentified adapters had been left in the data.  Upon reviewing the literature I chose PEAT (Paired End Adapter Trimming) for trimming adapters as it requires no a priori adapter sequence.   FastQC also indicated quality dropoff at the ends of the reads.  I used Sickle to trim the low-quality bases.

I looked over the steps in the exome analysis pipeline I was given and noticed that some of them seemed to use outdated methods.  For instance, bwa aln was included in the pipeline, whereas bwa mem has been recommended for read lengths longer than 70 base pairs for several years.   Furthermore, disk space was at a premium on the OSC cluster. Modern next-generation sequencing pipelines pipe intermediate results in memory from one process to the next for several steps, saving on intermediate steps.  I switched to the Blue Collar Bioinformatics suite, developed at the Harvard School of Public Health and now widely used for many bioinformatics tasks and began running it. This suite includes a standard pipeline for variant calling of matched tumor-normal samples. As given, this requires running one job for each sample that pools all lanes of the tumor sequence and all lanes of the normal sequence.  This seems to require tinkering with the resource allocation when submitting the job to OSC, such that the job is not killed in the middle of alignment, yet the resource allocation is not so large that the scheduler never schedules it. Another alternative is to downsample each tumor and normal exome sample. I asked Xing Tang for help with all this when she has time.  The Blue Collar Bioinformatics blog also describes a suggested analysis for tumor-only samples.

Tuncbag et al developed a method “Simultaneous Reconstruction of Signaling Pathways Using Prize-Collecting Steiner Forests” that, given a set of proteins thought to be involved in a signaling process within a cell, uses biological network data to hypothesize what other proteins might be involved in this process.  They use a message-passing algorithm from Bayesian networks.  I developed a novel plate model (plate models are used in probabilistic graphical models) to extend this from cell-autonomous processes to intercellular processes, in order to span intercellularly between cells of different types (e.g., epithelial cells and fibroblasts) as well as in the extracellular space.

I have downloaded the msgsteiner and OmicsIntegrator software from the Tuncbag et al paper. I am modifying it to work intercellularly using my plate model.  Also, it comes with generic protein-protein interaction data. I will incorporate the ligand-receptor database used by Fuhai Li in his CCCExplorer, and also the dbPTM database of post-translational modifications of proteins, such as phosphorylation.  Their method uses epigenome data to constrain the possible transcription factors that might be hypothesized. I have found epigenome data from mouse mammary luminal cells of virgin Fvb/N mice, and mouse dermal fibroblast cells of Balb/c mice.

Once I have incorporated the dbPTM database and the Li ligand-receptor database, I will try the OmicsIntegrator method without the plate model on the fibroblast secretome data alone.

I attended as a guest several sessions of the Pathology of Inflammation class taught by Traci Wilgus.  One key thing I learned is that during wound healing, neutrophils that traffick to the wound don’t necessarily all apoptose there as had previously been thought. In zebrafish, they have been visualized trafficking back and forth repeatedly, to the wound and back out to the blood vessel.  Neutrophils have also been imaged tethering one to another, i.e., a neutrophil grabs onto another neutrophil.  The class described the several steps required for inflammatory cells to traffick from a blood vessel to the wounded tissue: rolling with weak binding to the vessel wall via selectins, activation with stronger binding to the vessel wall by integrins; firm adhesion and crawling along the vessel wall, and transmigration through the tissue along a chemokine gradient.  These steps require multiple specializations. I formulated a hypothesis, which Wilgus thought novel, that the reverse trafficking of neutrophils back out from the wound to the blood vessel may be in order to go fetch stem cells from elsewhere.  In this way the stem cells would not require all the specializations that neutrophils have in order to extravasate themselves; the neutrophils could carry them.  If this were correct, and if macrophages do the same thing, then the Tumor Microenvironment of Metastasis (TMEM) might be a co-option of a similar process  in macrophages. In this case, understanding the gradients that lead the inflammatory cells back to the wound carrying the stem cells could help illuminate the reasons why metastasis occurs to particular sites, and what local and systemic factors might affect when it occurs.

I presented a poster at the T2C Wound Healing conference at OSU, constituting a literature review of wound healing processes that might be related to metastatic and post-metastatic cancer.  I found one novel connection.

I learned from Lisa Christian that the Experience Sampling Method is used in many studies to track measurements of cytokine levels in the bloodstream of individual patients, longitudinally or under different conditions, with clinical variables. Data from such studies might be integrated into the model to be developed in Aim 3, for instance to select the most useful molecules or biomarkers to include in order to incorporate systemic factors affecting the tumor microenvironment.