Regulatory Module Discovery (RMD)
Citation:
Modular combinatorial binding among human trans-acting factors reveals direct and indirect factor binding. Yuchun Guo & David K Gifford, (2017) BMC Genomics.
The software and data can be downloaded from the following links.
1. Unzip k562.tar.gz file. Download gem.jar file. Run the following command line.
java -Xmx15G -jar gem.jar RMD --g hg19.chrom.sizes --ex hg19_encode_blacklist.txt --tf_peak_file k562.tf_peak_file.txt --distance 50 --min_site 3 --out k562
--distance specifies the distance for merging nearby TF binding sites into co-binding regions.
--min_site specifies the minimum number of TF sites in a region.
--tf_peak_file takes a text file of the format "TF_Name TAB TF_GEM_event_file_path", as shown in the example k562.tf_peak_file.txt in k562.tar.gz.
If you are not using GEM binding event files as input, you can also use BED-format peak call files, with option --format bed . Then the --tf_peak_file parameter takes a text file of the format "TF_Name TAB TF_peak_BED_file_path".
2. With the output file (0_BS_clusters.k562.d50.min3.HDP.txt), run the HDP software from the command line.
your_hdp_path/hdp --algorithm train --data 0_BS_clusters.k562.d50.min3.HDP.txt --eta 0.1 --directory out_folder_name --random_seed 0 --init_topics 50 --max_iter 2000
You may want to explore the --eta and --init_topics parameters depending on the number of TFs and number of sites per region.
Check state.log file from HDP for convergence (the likelihood values should approach a maximum value, with some small fluctuation). Adjust --max_iter or other parameters if HDP did not converge. You may also try multiple runs with different --random_seed values and select the one gives the best converged likelihood value.
3. Post-process the HPD output files (the file names and numeric values in the files are all zero-based) to cluster or visualize the modules.
- 01900-topics.dat: the module/topic matrix at 2000 iteration. Row: Module (topic); Column: TFs. The TF labels can be found in the 0_BS_clusters.K562.Dictioinary.txt file.
- 01900-word-assignments.dat: the assignment of TF binding sites of each region to a module. The IDs are all 0-based.
- Column 1, d: Region (document) ID;
- Column 2, w: TF (word) ID;
- Column 3, z: Module (topic) ID;
- Column 4, t: not used for RMD.
|