# 6.189 IAP 2007

**Recitation 5** 

**Cell Profiling Tools** 

## **Agenda**

- Cell Simulator Overview
- Dynamic Profiling Using Counters
- Instruction Scheduling

### **Cell Simulator Highlights**

- Full system simulator can help in debugging and performance optimization
  - Uni-Cell and multi-Cell simulation
  - GUI user Interfaces
  - Cycle accurate SPU simulation
  - Facility for tracing and viewing simulation events
- Note: does not accurately model communication cost

#### Run Cell Simulator

- Launch simulator GUI interface
  - % export SYSTEMSIM\_TOP=/opt/ibm/systemsim-cell
    /opt/ibm/systemsim-cell/bin/systemsim -g &
  - Then click "go"

#### **Main GUI Interface**



#### **Simulated Linux Environments**

Simulated Linux shell as if running on Cell hardware

```
root@(none):~ (on haldi)
SPE 0 has LS + offset of f7fa2080
SPE 1 has LS + offset of f7762000
SPE 0 received buffer "Good morning!"
Last SPE has received buffer of Guten Morgen!
New modified buffer is Bon jour!
[root@(none) ~]# ./dist
Generated random points
Calculating
Done
Correct results
[root@(none) ~]# ./dist
Generated random points
Calculating
Done
Correct results
[root@(none) "]# ./dist
Generated random points
Calculating
Done
Correct results
[root@(none) ~]# 🛮
```

#### Simulated and Native Linux Interoperability

- Simulated Linux has its own file system
- Files can be transferred between the native file system and the simulated file system using the callthru utility
- Example: transfer and execute a Cell program
  - % callthru /tmp/hello-world > hello-world
  - % chmod u+x hello-world
  - % ./hello-world

#### Debugging

#### View machine state



### **Profiling**

- Dynamic profiling and statistics
  - Separate stats for PPU and each SPU



### **Code Instrumentation and Profiling**

- Fine-grained measurements during simulation are possible via prof\_\* routines
  - Profiling routines are no-ops on the Cell hardware

```
#include <profile.h>
...
prof_clear();
prof_start();
function_of_interest();
prof_stop();
```

### **Cell Simulator Availability**

- Simulator is not installed on the PS3 hardware
- Contact TAs if you want to run the simulator

Phil Sung, MIT. 11 6.189 IAP 2007 MIT

## **Agenda**

- Cell Simulator Overview
- Dynamic Profiling Using Counters
- Instruction Scheduling

#### **Performance Counters on the SPUs**

- Each SPU has a counter that counts down at a fixed rate (decrementer)
  - Can be used as a clock
  - Suitable for coarse-grained timing (1000s of instructions)

#### Decrementer Example

```
#define DECR_MAX 0xFFFFFFF
#define DECR_COUNT DECR_MAX
// Start counting
spu_writech(SPU_WrDec, DECR_COUNT);
spu_writech(SPU_WrEventMask, MFC_DECREMENTER_EVENT);
start = spu_readch(SPU_RdDec);
  function_of_interest();
// Stop counting, print count
end = spu_readch(SPU_RdDec);
printf("Time elapsed: %d\n", start - end);
spu writech(SPU WrEventMask, 0);
spu_writech(SPU_WrEventAck, MFC_DECREMENTER_EVENT);
```

## **Agenda**

- Cell Simulator Overview
- Dynamic Profiling Using Counters
- Instruction Scheduling

#### **Review: Instruction Scheduling**

Instructions mostly of the form

$$r3 = f(r1,r2)$$

- Assembly file is a human-readable representation of these instructions
- Conceptually, instructions execute in the order in which they appear in assembly



#### **Review: Instruction Scheduling**

- With pipelining, order of instructions is important!
  - Pipeline stalls while waiting for dependencies to complete



#### **Static Profiling**

- Use static profiling to see where stalls happen
- Generate assembly and instruction schedule
  - Manually

```
# generate assembly (xlc -S also works)
% gcc -S filename.c
# generate timing information
```

- % /opt/ibm/cell-sdk/prototype/bin/spu\_timing
  -running-count ./filename.s
  - Output stored in filename.s.timing
  - -running-count shows cycles elapsed after each instruction
- With our Makefile
  - % SPU\_TIMING=1 make filename.s

### Reading the Assembly

- Instructions of the form
   OP DEST SRC1 SRC2 ...
- Header indicates source files:

```
.file "dist_spu.c"
.file 1 "dist_spu.c"
.file 2 "/opt/ibmcmp/xlc/8.1/include/spu_intrinsics.h"
```

• Markers for source lines:

```
.LS_p1_f1_l19: File 1 (dist_spu.c),
.loc 1 19 0 Line 19

ila $7,a
```

### Interpreting Static Profiler Output



-running-count adds cycle count column

### Instruction Scheduling on Cell

- In-order execution
- Dual pipeline
  - Pipeline selected based on instruction type
  - Two instructions can be issued simultaneously when dependencies allow
- Goal: scheduling instructions to minimize stalls
  - Loads, fp instructions liable to take a long time
  - Dual-issue whenever possible
  - IPC = 2 (instructions per cycles)
  - CPI = .5 (cycles per instruction)

#### **Example Schedule Optimization**

```
(dist_spu.s line 246) .LS_p1_f1_126:
                     .loc 1 26 0
78
                     or $2,$3,$3
                     ila $3,dist
89
 901234
                     lqd $4,80($1)
                     shli $4,$4,8
  ----5678
        678901
                     lqd $5,96($1)
                     shli $5,$5,2
         ----2345
                          $4,$4,$5
               ---67 a
```

#### **Example Schedule Optimization**

```
(dist_spu.s line 246) .LS_p1_f1_126:
                      .loc 1 26 0
                     lqd $4,80($1)
789012
                     lqd $5,96($1)
890123
                     or $2,$3,$3
 90
                      ila $3,dist
  01
   --3456
                     shli $4,$4,8
                     shli $5,$5,2
      4567
                           $4,$4,$5
       ---89
                     a
```

#### 8 cycles saved

## **Exercise 1 (10 minutes)**

- Improve performance by rescheduling instructions
  - wget http://cag.csail.mit.edu/ps3/recitation5/rec5.tar.gz
  - tar zxf rec5.tar.gz
  - cd rec5/lab1/spu
- Examine assembly code
  - export CELL\_TOP=/opt/ibm/cell-sdk/prototype
  - SPU\_TIMING=1 make dist\_spu.s
  - Find an opportunity for performance gain via instruction scheduling and implement it (e.g., reduce stalls after lqd instructions near line 246)
- Generate object file from assembly
  - ./make-obj-file; cd ..; make
  - make-obj-file compiles your modified assembly to binary, otherwise your optimization is lost
- Run and evaluate
  - How many cycles did you save?
    - /opt/ibm/cell-sdk/bin/spu\_timing -running-count dist\_spu.s
  - Is the new code correct?
    - Run and check if correctness test passes

## Instruction Scheduling

- Compilers are very good at doing this automatically
  - Unoptimized code: 469 cycles
  - Optimized code (xlc -05): 188 cycles
- Hand-reordering of optimized assembly is unlikely to produce significant gains except in extreme scenarios

Phil Sung, MIT. 25 6.189 IAP 2007 MIT

### **Notes on Static Profiling**

- Static profiler presents a skewed view of conditionals, loops
  - 8 cycles saved in the static schedule → how many cycles saved when the program runs?
- Data-dependent behavior not captured
  - Static profiler does not factor in loop trip counts or branch frequencies
  - Profiling doesn't account for branch misprediction

#### **Improving Branch Prediction**

- Static branch hinting from source code
  - if(\_\_builtin\_expect(CONDITION, EXPECTED))
  - Useful macros:
    - #define LIKELY(exp) \_\_builtin\_expect(exp, TRUE)
      #define UNLIKELY(exp) \_\_builtin\_expect(exp, FALSE)
    - if(LIKELY(i == j)) { ... }

#### Summary

 Static and dynamic profiling tools are used to identify performance bottlenecks

| Method                                                       | Pros                                                  | Cons                                        |
|--------------------------------------------------------------|-------------------------------------------------------|---------------------------------------------|
| Cell simulator Use to get statistical info on program runs   | Good statistics on stall sources; no recompile needed | Simulator is slow                           |
| Decrementers Use to measure runtime for a segment of code    | Easy to set up                                        | Little insight into sources of stalls       |
| Schedule analysis Use to see instruction- level interactions | Identifies exactly where time is spent                | Low level; only does straight-line analysis |

Phil Sung, MIT. 28 6.189 IAP 2007 MIT