KSM and KMAC
KSM and PWM motifs from ENCODE phase III TF ChIP-seq data
The Java software and test data can be downloaded from the following links.
Examples on running the KMAC and KSM code
Download gem.jar and other files. Run the following on the command line. Note: Specify KMAC or KSM as the first parameter.
KMAC motif discovery (Example output)
KSM motif scanning (Example output)
Note: The KSM motif match SeqPos position shows the expected binding position of the k-mer set. Because the k-mers are consistently aligned in a KSM, you can figure out what is the binding position in the k-mer sequences. For example, in the provided Oct4.KSM.txt, the offset of first k-mer ATGCAAA is 1, which means the starting base A of ATGC is at position 1 relative to the binding position (at 0). From the fouth k-mer TATGCANA (offset 0), you can also figure out the binding position is T, one base before ATGC. With this in mind, you can check the motif match position in the query sequence. In addition, for motif instances on the minus strand, the SeqPos is the position on the reverse compliment of the input sequence.
KSM file format1. In the k-mer sequence, N stands for a gap in the gapped k-mers.
2. The $$$ sign is to signal that the k-mers below are the base exact k-mers for the gapped k-mers. For example, a gapped k-mer ACCNT consists of base k-mers ACCAT, ACCCT, ACCGT, and ACCTT.
Incorporating sequence weightsFor datasets that have a weight associated with each sequence, such as the read count of a ChIP-seq binding event, KMAC by default weights the positive sequences with a factor of the natural logarithm of the input sequence weight and then normalizes the weights such that the average is equal to one. To obtain the sequence hit count for k-mers and k-mer groups, the total weights of the sequence hits are summed and rounded. Other weighting schemes such as identity, square-root, or no-weighting can be specified by the users using the
The input sequence weights should be provided in the fasta header lines. The format is:
Command line options
The AUC KMAC uses is not the standard AUROC, but a partial AUROC (fpr<=0.1), i.e. the left 0.1 portion of the full ROC. Therefore the pAUC score is from 0 to 0.1 (see the KSM paper). For better readability, the auc values that KMAC reports, including those in the HTML results, are scaled 1000x. That means auc=35.7 is equivalent to pAUROC=0.0357, meaning it is at 35.7% of the maximum value.
Post your questions, problems, or suggestions on our GEM3 GitHub page by creating a "New Issue".