A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction
Genome Research, 2018. 28: 891-900, Yuchun Guo, Kevin Tian, Haoyang Zeng, Xiaoyun Guo, and David K. Gifford.
An earlier version of the paper has also been deposited to bioRxiv.

KSM and PWM motifs from ENCODE phase III TF ChIP-seq data

The Java software and test data can be downloaded from the following links.

Examples on running the KMAC and KSM code

Download gem.jar and other files. Run the following on the command line. Note: Specify KMAC or KSM as the first parameter.

KMAC motif discovery (Example output)

java -Xmx8G -jar gem.jar KMAC --pos_seq ES_Oct4_61bp.fasta --k_win 61 --k_min 5 --k_max 13 --k_top 10 --out_name Oct4

KSM motif scanning (Example output)

java -Xmx8G -jar gem.jar KSM --fasta ES_Oct4_61bp.fasta --ksm ksm_list.txt --out Oct4.KSM.scan

Note: The KSM motif match SeqPos position shows the expected binding position of the k-mer set. Because the k-mers are consistently aligned in a KSM, you can figure out what is the binding position in the k-mer sequences. For example, in the provided Oct4.KSM.txt, the offset of first k-mer ATGCAAA is 1, which means the starting base A of ATGC is at position 1 relative to the binding position (at 0). From the fouth k-mer TATGCANA (offset 0), you can also figure out the binding position is T, one base before ATGC. With this in mind, you can check the motif match position in the query sequence. In addition, for motif instances on the minus strand, the SeqPos is the position on the reverse compliment of the input sequence.

Add a --matrix option to generate a KSM motif feature matrix (sequence x motif matrix, each element is the maximum KSM motif score in the sequence).

java -Xmx8G -jar gem.jar KSM --fasta ES_Oct4_61bp.fasta --ksm ksm_list.txt --out Oct4.KSM.scan --matrix

Add a --simple option to generate a KSM motif instances file with fewer columns. This is useful to reduce file size when KSM motif scanning is run on whole genome/chromosome.

java -Xmx8G -jar gem.jar KSM --fasta ES_Oct4_61bp.fasta --ksm ksm_list.txt --out Oct4.KSM.scan --simple

KSM file format

1. In the k-mer sequence, N stands for a gap in the gapped k-mers.
2. The $$$ sign is to signal that the k-mers below are the base exact k-mers for the gapped k-mers. For example, a gapped k-mer ACCNT consists of base k-mers ACCAT, ACCCT, ACCGT, and ACCTT.

Incorporating sequence weights

For datasets that have a weight associated with each sequence, such as the read count of a ChIP-seq binding event, KMAC by default weights the positive sequences with a factor of the natural logarithm of the input sequence weight and then normalizes the weights such that the average is equal to one. To obtain the sequence hit count for k-mers and k-mer groups, the total weights of the sequence hits are summed and rounded. Other weighting schemes such as identity, square-root, or no-weighting can be specified by the users using the --swt [n] option.

The input sequence weights should be provided in the fasta header lines. The format is:

For example,
>seq_1 691.0 ES_Oct4 3:121848465-121848525
>seq_2 285.0 ES_Oct4 14:86796013-86796073

Command line options

Options Detailed information
--pos_seq [string] the file path to the positive fasta sequence file
--k [n] the width of the exact sequence part of the k-mers. The total length of a k-mer is k+gap.
--k_min [n] --k_max [n] the minimum and maximum values of k
--seed [k-mer] the seed k-mer to jump start k-mer set discovery. Exact k-mer sequence only. The width of the seed k-mer will be used to set k
--k_seqs [n] the number of top sequences for motif discovery (default=5000)
--k_win [n] the window size of sequences for motif discovery (default=61)
--k_top [n] the max number of k-mers (i.e. k-mer cluster centers) to seed KMAC for each k value (default=5)
--gap [n] the maximun number of gaps, KMAC uses from 0 to max gaps (default=4)
--kmer_hgp [n] the p-value threshold (log10) of hyper-geometric test for enriched kmers (default=-5.0)
--swt [n] the sequence weight type: 0: no weighting, all weights=1.0; 1: as in the input flie; 2: square root; 3: natural logarithm (default=3)
--gc [n] the gc content, if gc=-1, the value will be estimated from the input positive sequences (default=0.41 for human, use 0.42 for mouse)
--out_name [name] the output folder and file name prefix
--neg_seq [string] the file path to the negative fasta sequence file (optional)
--print_aligned_seqs Optional flag, when ON, print out the sequences aligned by the KSMs. Default is OFF.


About AUC
The AUC KMAC uses is not the standard AUROC, but a partial AUROC (fpr<=0.1), i.e. the left 0.1 portion of the full ROC. Therefore the pAUC score is from 0 to 0.1 (see the KSM paper). For better readability, the auc values that KMAC reports, including those in the HTML results, are scaled 1000x. That means auc=35.7 is equivalent to pAUROC=0.0357, meaning it is at 35.7% of the maximum value.

Post your questions, problems, or suggestions on our GEM3 GitHub page by creating a "New Issue".