Regulatory Module Discovery (RMD)

Citation:
Modular combinatorial binding among human trans-acting factors reveals direct and indirect factor binding.
Yuchun Guo & David K Gifford, (2017) BMC Genomics.

The software and data can be downloaded from the following links.

1. Unzip k562.tar.gz file. Download gem.jar file. Run the following command line.

java -Xmx15G -jar gem.jar RMD --g hg19.chrom.sizes --ex hg19_encode_blacklist.txt --tf_peak_file k562.tf_peak_file.txt --distance 50 --min_site 3 --out k562

  • --distance specifies the distance for merging nearby TF binding sites into co-binding regions.
  • --min_site specifies the minimum number of TF sites in a region.
  • --tf_peak_file takes a text file of the format "TF_Name TAB TF_GEM_event_file_path", as shown in the example k562.tf_peak_file.txt in k562.tar.gz.
If you are not using GEM binding event files as input, you can also use BED-format peak call files, with option --format bed. Then the --tf_peak_file parameter takes a text file of the format "TF_Name TAB TF_peak_BED_file_path".

2. With the output file (0_BS_clusters.k562.d50.min3.HDP.txt), run the HDP software from the command line.

your_hdp_path/hdp --algorithm train --data 0_BS_clusters.k562.d50.min3.HDP.txt --eta 0.1 --directory out_folder_name --random_seed 0 --init_topics 50 --max_iter 2000

You may want to explore the --eta and --init_topics parameters depending on the number of TFs and number of sites per region.

Check state.log file from HDP for convergence (the likelihood values should approach a maximum value, with some small fluctuation). Adjust --max_iter or other parameters if HDP did not converge. You may also try multiple runs with different --random_seed values and select the one gives the best converged likelihood value.

3. Post-process the HPD output files (the file names and numeric values in the files are all zero-based) to cluster or visualize the modules.

  • 01900-topics.dat: the module/topic matrix at 2000 iteration. Row: Module (topic); Column: TFs. The TF labels can be found in the 0_BS_clusters.K562.Dictioinary.txt file.
  • 01900-word-assignments.dat: the assignment of TF binding sites of each region to a module. The IDs are all 0-based.
    1. Column 1, d: Region (document) ID;
    2. Column 2, w: TF (word) ID;
    3. Column 3, z: Module (topic) ID;
    4. Column 4, t: not used for RMD.