GPSCitation:Discovering homotypic binding events at high spatial resolution. Yuchun Guo, Georgios Papachristoudis, Robert C Altshuler, Georg K Gerber, Tommi S Jaakkola, David K Gifford & Shaun Mahony, Bioinformatics. 2010 Dec 15;26(24):3028-34. Epub 2010 Oct 21. PMID: 20966006. GPS has been superceded by our new method, Genome wide Event finding and Motif discovery (GEM). Further improvement for GPS will be incorporated into GEM. Please contact us if you need to download previously released GPS software.The Genome Positioning System (GPS) is a software tool to study protein-DNA interaction using ChIP-Seq data. GPS builds a probabilistic mixture model to predict the most likely positions of binding events at single-base resolution. GPS has 3 main features:
DownloadPlease download our new method, Genome wide Event findin\ g and Motif discovery (GEM). GPS is now a part of GEM.System requirementsJava
1.6 is required to execute the JAR. For a ChIP-Seq experiment with 4
million IP reads and 8 million control reads, GPS requires about 750M
memory, and runs for about 20 minutes on an single CPU AMD 64bit 2.3GHz
computer (~6 minutes, multi-threads, on a 8-CPU computer).
For some machines, the default maximum memory heap size for the Java
virtual machine may not be large enough. It can be specified at the
command line with the option Read distributionsA
read distribution file is required for GPS. The
user can use the default read distribution file provided with GPS
software as starting point. After one round of prediction, GPS will
re-estimate the read distribution using the predicted events. -344
1.42285E-4 Alternatively, it can be estimated directly from the ChIP-Seq data. Given a set of events, we count all the reads at each position (the 5' end of the reads) relative to the corresponding event positions. The initial set of events for estimating the empirical spatial distribution can be defined by using known motifs or by finding the center of the forward and reverse read profiles (if available). GPS has a tool to calculate the read distribution from a user provided file (coords.txt) containing the coordinates:
After GPS makes the prediction, it will re-estimate the read
distribution using the predicted events. A command line option If the data are too noisy or too few events are used for re-estimation, the new read distribution may not be accurate. The users are encouraged to examine the read distribution using the plot of read distributions (X_All_Read_Distributions.png) output by GPS. Input and outputGPS takes an alignment file of ChIP-Seq reads as input and reports a list of predicted binding events. ChIP-Seq alignment file formats that are supported: GPS outputs a tab-delimited file (xxx_n_GPS_significant.txt ) with following fields:
Because of the read distribution re-estimation, GPS may output event prediction and read distribution files for multiple rounds. (See more details) Optionally,
GPS can be set to output BED files (using option Examples:This data can be used to test GPS. It comes from a Ng lab publication (PMID: 18555785) and consists of Bowtie alignments of mouse ES cell CTCF ChIP-seq and GFP control reads.Once
everything is unpacked into the same directory as the gps.jar
file, use the following command: An
example of GPS run in multi-condition alignment mode is (Please note:
the multi-condition mode may take longer time to run) Command-line options: The
command line parameters are in the format of
Some parameters are optional:
Optional flags:
Q &AWhich round of result should I use?Because of the read distribution re-estimation, GPS may output event prediction and read distribution files for multiple rounds.The round numbers are coded in the file name. For example,
Multi-condition v.s. Multi-replicates GPS can analyze binding data from multiple conditions (time points) simultaneously. The user need to give them different names, for example, -–exptCond1 CTCF_cond1.bed -–exptCond2
CTCF_cond2.bed .For multiple replicates of same condition, you can specify multiple replicates as separate files, for example, -–exptCond1 CTCF_cond1_rep1.bed -–exptCond1
CTCF_cond1_rep2.bed (note that they need to have the same
name). GPS will combine the replicates as one large dataset for
analysis. Read filtering and event filtering PCR amplification artifacts typically manifest as the observation of many reads mapping to the exact same base positions. These artifacts are quite variable and dataset-specific. Therefore, a generic approach to exclude those regions might result in the loss of true events. GPS implements an event filtering method by comparing the read distribution of the predicted event to the expected event read distribution. A shape deviation score (IPvsEMP field) is computed using Kullback–Leibler divergence (see method section 2.6 of GPS paper). A higher score means the event is more divergent from the expected read distribution, hence more likely to be artifact or noise. A cutoff score can be specified by user to filter out spurious events using option ( --sd ).
GPS also
excludes events with less than 3 fold enrichment (IP/Control). GPS
reports the filtered events, hence allows the user to verify and adjust
cutoff threshold for a particular dataset. The shape deviation filter
is on by default, but can be turned off using option (--nf ).
In addition, GPS also applies a Poisson filter for abnormal high read count at a base position. For each base, we obtain an average read count by estimating a Gaussian Kernel density (with std=20bp) on the read counts of nearby base positions (excluding the base of interest). The estimated value is used to set Lambda parameter of Poisson distribution. The actual read count value is then set to the value corresponding to p-value=0.001 if it is larger. ContactContact Yuchun Guo (yguo at mit dot edu) or Shaun Mahony (mahony at mit dot edu) with any problems, comments, or suggestions. Sign up for GPS mailing list to receive emails related to GPS updates, release, etc. |
|