###################################################################################
Tejaswini Mishra, August 23, 2013
###################################################################################
 
1. This file contains expression data for 22977 genes in 15 different samples (includes cell types and treatments). 

2. Replicates have been combined to get pooled data.

3. Expression levels are reported as log2 FPKM. Threshold for expression is 3. Genes with log2 FPKM > 3 should be treated as expressed. 

4. All FPKMs  are log-2 transformed and normalised between cell types using DESeq-like normalisation (scaling by a size factor).

5. "PoolNormLog2FPKMwithDEnoSilent.txt" has pooled log2 FPKM and differential expression test results for 2 pairwise comparisons, "Meg Vs. Ery" & "G1E Vs. ER4 30hrs".

6. Genes that pass an FDR of 0.05 are marked significant by Cuffdiff and I parse the fold change column to separate these into up and down regulated genes. Any genes that are significant by Cuffdiff, but are silent in both cell types interrogated are marked as 'not changing'.

7. Interpretation of differential expression codes: 
	 0 = not changing
	 1 = upregulated (in Cell Type 2 as compared to Cell Type 1)
	-1 = downregulated (in Cell Type 2 as compared to Cell Type 1).

8. The "G1E Vs. ER4 30hrs" should be treated as the final set of differentially expressed genes in the G1E-ER4 system.

9. FPKMs for globins are extrapolated from corresponding read counts. Using data from 3 other highly expressed genes Fth1, Ftl1 and Rplp1 I get a ratio of read counts to FPKM and I use this ratio to get globin FPKMs from counts. Globins are marked as "not changing" in the differential expression results. 

10. Gene coordinates in BED12 format have been generated by converting "genes_noRandom_OneGeneModelNew.gtf" into BED12 format, using the gtf2bed.pl script from http://ea-utils.googlecode.com/svn/trunk/clipper/gtf2bed

11. "genes_noRandom_OneGeneModelNew.gtf" was obtained by downloading and modifying a UCSC-sourced RefSeq gene model annotation file, "genes.gtf" from the Illumina iGenomes collection. 

12. Steps: Go to link http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn Choose organism "Mus musculus", source UCSC, build mm9, gene model RefSeq. The modifications were as follows: 
	
	- removed lines matching chr_xx_random, chr_Un
	- kept only one transcript model per gene, using information from UCSC Genes track, "knownCanonical" table to identify canonical isoforms for genes. If there was no entry for a gene in this table, length of transcript, number of exons and length of CDS were used to retain longest possible transcript.
	- removed lines matching pattern "Snord" to remove snoRNAs. However, final file was found to contain snoRNAs with names matching the pattern "Snora". These ought to be removed in the future.

###################################################################################
###################################################################################