ࡱ> 4( P /0(  0;[0 0 000$([\{b00 000000000  0=] 0 0 0000 2 3 !A0C0E0G0I0c00000000000000000!%),.:;?]}acdeghijklmnopDTimesNew Ro63( )0(Y 0 DArialNew Ro63( )0(Y 0" DWingdingsRo63( )0(Y 00DArial Black63( )0(Y 0"@DCourier New63( )0(Y 01PDTimes New Roman3( )0(Y 0c.z@  @@``   @n?" dd@  @@`` $$0!%&A  4 #E& iW22) C' C; 656 D=x( )OQIb=Jp} ''+R T)Nr!v( )3X"xALD- v;K> * >T *C4s$ "! e"!( J,  nI n1G  sH;V0 i+;Tua&G+>,IF 9&0a>,5* * OV9-F[PZZ"&" $"X| ; Vu!1u!1Ku,:u!1  >bW\ In  >bW\; >d;RW+z#w #r^ V[ >bW\  In  G; >d;RW+z#w #r^ V[;RWu!1>bW\1GIn  ;RW >d>XW<w(*r^#>?a/w; 6 $Y;;u!11< < E7!_  <>HH8H   .nmd}{[ <HHfbM` $ a p#z Q"'va{  ԌjԌj     A@ AjJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||S" 33ff@ 42d2d )0pp0pp0p<4BdBd 04<4!d!d 04rʚ;g9ʚ;<4ddddH 04:2___PPT9/ 0? %O = 1The Vector-Thread Architecture$ Ronny Krashinsky, Chris Batten, Krste Asanovi Computer Architecture Group MIT Laboratory for Computer Science ronny@mit.edu www.cag.lcs.mit.edu/scale Boston Area Architecture Workshop (BARC) January 30th, 2003 H/A(>Fp     A  0p} 0~  Introduction  Instruction Parallelism Loop Parallelism Data Parallelism Thread Parallelism #Vector-Thread Architecture Overview$ !Vector Architecture &Vector Microarchitecture # Vector-Thread Architecture  ' Using VPs for Do-Across Loops 2!Vector-Thread Microarchitecture " % Do-Across Execution +Micro-Threading VPs 4"Loop Parallelism and Architectures #"# ,$Using the Vector-Thread Architecture% -SCALE-0 Overview 5 P ` ̙33` ` ff3333f` 333MMM` f` f` 3>?" dz ?nKZ2z%o%l b%m7v  n?" dd@   @@``PR     T` p>> TL%(    ZPmԌjԌj1 ?F)  X Click to edit Master title style!! D  ZHpԌjԌj1 ?4  RClick to edit Master text styles Second level Third level Fourth level Fifth level!    S dB % <D>?EEB  s *nv ? aff 22161XXX 0 ^V`( v,p,A, D  Zseԯeԯ1 ? L  RClick to edit Master text styles Second level Third level Fourth level Fifth level!    S p  01 ?*&0  B  s *ftke ? a( P 0(   B  s *ftke ? a( 0x$(  xr x S fB p B r x S gB  B B x s *nv ? aff  d\`0(  0r 0 S BF)  B B 0  ``ԌjԌj jJ?" Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach which can flexibly take advantage of many forms of parallelism available in different applications  instruction, loop, data, thread The key goal of the vector-thread architecture is efficiency  high performance with low power consumption and small area A clean, compiler-friendly programming model is key to realizing these goalsP"z1M1P"zM   H 0 0nv ? afft6  $66p7L 5(     fԌjԌjjJ?" 0 6  fԌjԌjjJ?" pr  S , `F)  ` ,    ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"     fԌjԌj3jJ?" P B  3 rԌjԌjDo?"  ,   ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"    fԌjԌj3jJ?" PB  3 rԌjԌjDo?" ,   ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"     fԌjԌj3jJ?"PB  3 rԌjԌjDo?"    fԌjԌj3jJ?" @   fԌjԌj3jJ?"    fԌjԌj3jJ?"    fԌjԌj3jJ?" @   fԌjԌj3jJ?" `   fԌjԌj3jJ?"     fԌjԌj3jJ?"    fԌjԌj3jJ?"pp   fԌjԌj3jJ?"0     fԌjԌj3jJ?" @  !  fԌjԌj3jJ?"`, "  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"  B $ 3 rԌjԌjDo?" 0 , %  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" B ' 3 rԌjԌjDo?"0, (  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"  B * 3 rԌjԌjDo?"0 1  fԌjԌj3jJ?" p  2  fԌjԌj3jJ?" p 3  fԌjԌj3jJ?"p 7 # l`ԌjԌj jJ?"2  X  V Super-scalar     8 # l,`ԌjԌj jJ?"R ex  NVLIW  9  fԌjԌjjJ?" p :  fԌjԌj3jJ?" p  ;  fԌjԌj3jJ?" p <  fԌjԌj3jJ?"p =  fԌjԌjjJ?" 0 >  fԌjԌj3jJ?" 0  ?  fԌjԌj3jJ?" 0 @  fԌjԌj3jJ?"0 A  f!`ԌjԌj jJ?"0  HIndependent instructions can execute concurrently Super-scalar architectures dynamically schedule instructions in hardware to enable out-of-order and parallel execution Software statically schedules parallel instructions on a VLIW machine 2 B B@ # lԌjԌjDjJ?"  B C@ # lԌjԌjDjJ?" B D@ # lԌjԌjDjJ?"  B E@ # lԌjԌjDjJ?" F # lԌjԌjjJ?" :  G # lԌjԌjjJ?"   H # lԌjԌjjJ?"Z I # lԌjԌjjJ?"  J # lԌjԌjjJ?" &jB K 3 rԌjԌjDo?"0 L # l.`ԌjԌj jJ?"@P ftrack instr. dependencies H  0nv ?_ F G!HIJ aff4K  JJK_ \H(   ^  fԌjԌjfjJ?"0 P ]  fԌjԌjfjJ?" r  S 5`F)  ` ,   ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" p B  3 rԌjԌjDo?" p ,   ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" pB  3 rԌjԌjDo?" p ,    ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"pB   3 rԌjԌjDo?" p  # l8`ԌjԌj jJ?"` X  NVLIW    fԌjԌjjJ?" `   fԌjԌjjJ?" `    fԌjԌj3jJ?" `   fԌjԌj̙jJ?"`   fԌjԌjjJ?" `   fԌjԌjjJ?" `    fԌjԌj3jJ?" `   fԌjԌj̙jJ?"`   fԌjԌjjJ?"     fԌjԌjjJ?"       fԌjԌj3jJ?"   !  fԌjԌj̙jJ?"  "  f?`ԌjԌj jJ?" =Operations from disjoint iterations of a loop can execute in parallel VLIW architectures use software pipelining to statically schedule instructions from different loop iterations to execute concurrently2 2]\ B $@ # lԌjԌjDjJ?"  B %@ # lԌjԌjDjJ?" M # lԌjԌjfjJ?"   ' # lM`̙jJ?" y  Tload  ( # lO`3jJ?"n y'  Sadd  *  fDS`jJ?" yy  Qstore  + C xԌjԌjjJ?" h  , 3 rԌjԌjjJ?"-   - # lV`̙jJ?"i "  Tload  .  fQ`3jJ?" ~  Oadd  /  f^`jJ?" Qstore  0 3 rԌjԌjjJ?"(   1 3 rԌjԌjjJ?"  2  f$a`̙jJ?" !r  Pload  3  fd`3jJ?"! Oadd  4  fh`jJ?"g!  Qstore  5 3 rԌjԌjjJ?"x  6 3 rԌjԌjjJ?"a 7 3 rԌjԌjjJ?"-   8 3 rԌjԌjjJ?"  9 3 rk`ԌjԌj jJ?"   Siter. 0  : 3 rPn`ԌjԌj jJ?"   Siter. 1  ; 3 rq`ԌjԌj jJ?"   Siter. 2  = 3 rԌjԌjG jJ?"0 P >  fv`ԌjԌj jJ?"p  ]software pipeline F ` 0  _ ` 0  H # lԌjԌjjJ?"` @0  ?  fy`̙jJ?" 7  Pload  @  fd}`3jJ?" \ 7   Oadd  A  f`jJ?" 7 g  Qstore  B 3 rԌjԌjjJ?"] ] V  C 3 rԌjԌjjJ?"]  ]  D  ԌjԌj    BCDE F AjJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||@S"S W  E  ԌjԌj    BC`DE F AjJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||[`@S"  Sw  F C x8`ԌjԌj jJ?" : S '  Pprev  G C x`ԌjԌj jJ?"7 t  Pnext   J 3 r`ԌjԌj jJ?"i p   Qloop:  O  `܎`̙jJ?"  Pload  P  ``3jJ?"l % Oadd  Q  f`jJ?" w Ustore  R # lԌjԌjjJ?"f S 3 rԌjԌjjJ?"+ T # lԌjԌjjJ?"f U # lT`ԌjԌj jJ?"   Siter. 3  V  ``̙jJ?"Y0  Pload  W  f`3jJ?"0 n Sadd  X  fX`jJ?"0  Ustore  Y # lԌjԌjjJ?"  Z 3 rԌjԌjjJ?"t  [ # lԌjԌjjJ?"$  \ # l`ԌjԌj jJ?"i i Siter. 4 H  0nv ?'(+(*,-.0./1235346(.7 .38 ?@B @AC OPR PQS3PTVWYWXZW[ aff  # #(   r   S `F)  ` ,    ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"  ,    ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" P     fԌjԌj3jJ?" @B   3 rԌjԌjDo?" @ B   3 rԌjԌjDo?"  B   3 rԌjԌjDo?" p ,    ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" 0B   3 rԌjԌjDo?" p0    # l`ԌjԌj jJ?"   PVector     fԌjԌj3jJ?"      fԌjԌj3jJ?"     fԌjԌj3jJ?"     f`ԌjԌj jJ?"  1A single operation can be applied in parallel across a set of data In vector architectures, one instruction identifies a set of independent operations which can execute in parallel Control overhead can be amortized  2 B  @ # lԌjԌjDjJ?"  B  @ # lԌjԌjDjJ?"RF     `PP 2 !   fԌjԌjjJ?"2 "   fԌjԌjjJ?" 2 #   fԌjԌjjJ?"PH   0nv ? affH  HHP$G(  $ >$  fԌjԌjjJ?"0 `` c$  fԌjԌjjJ?"0   l$  fԌjԌjjJ?"```0 u$  fԌjԌjjJ?"` 0 $  fԌjԌjjJ?" @`r $ S `F)  `  $ # l`ԌjԌj jJ?"    XMultiprocessor  6$  f`ԌjԌj jJ?"@`  vSeparate threads of control can execute concurrently Multiprocessor architectures allow different threads to execute at the same time on different processors Multithreaded architectures execute multiple threads at the same time to better utilize a single set of processing resources 2 , 7$  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" 0@ 8$  fԌjԌj3jJ?" @pB 9$ 3 rԌjԌjDo?" @0  :$  fԌjԌj3jJ?" p ;$  fԌjԌj3jJ?" p <$  fԌjԌj3jJ?" pB W$@ # lԌjԌjDjJ?" B X$@ # lԌjԌjDjJ?"pp, ]$  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" 0 @ ^$  fԌjԌjjJ?" @pB _$ 3 rԌjԌjDo?"   `$  fԌjԌjjJ?" p a$  fԌjԌjjJ?" p b$  fԌjԌjjJ?" @pB d$@ # lԌjԌjDjJ?"   B e$@ # lԌjԌjDjJ?"p p, f$  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"0p g$  fԌjԌjjJ?"@B h$ 3 rԌjԌjDo?"@0 i$  fԌjԌjjJ?"  j$  fԌjԌjjJ?"  k$  fԌjԌjjJ?"B m$@ # lԌjԌjDjJ?"B n$@ # lԌjԌjDjJ?", o$  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"0 p p$  fԌjԌjjJ?"@B q$ 3 rԌjԌjDo?"  r$  fԌjԌjjJ?"  s$  fԌjԌjjJ?" t$  fԌjԌjjJ?"@B v$@ # lԌjԌjDjJ?" B w$@ # lԌjԌjDjJ?" , x$  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"` 0 y$  fԌjԌjjJ?"0 @B z$ 3 rԌjԌjDo?" @ , {$  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"p0  |$  fԌjԌjjJ?"@@ B }$ 3 rԌjԌjDo?"0@0, ~$  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"00 $  fԌjԌj33jJ?"P@0B $ 3 rԌjԌjDo?"@@@ $  fԌjԌj33jJ?"@p  $  fԌjԌj33jJ?"@p  $  fԌjԌj33jJ?"@0  $  fԌjԌj3jJ?"  $  fԌjԌj3jJ?"` @ $  fԌjԌj3jJ?"@ $  fԌjԌjjJ?"` $  fԌjԌjjJ?"Pp $  fԌjԌj3jJ?"0 0 $  fԌjԌj33jJ?"`P@ $ # l`ԌjԌj jJ?"   MSMT B $@ # lԌjԌjDjJ?"@ @B $@ # lԌjԌjDjJ?"    $ # lԌjԌjjJ?" P  $ # lԌjԌjjJ?" 6 $ # lԌjԌjjJ?" $ # lԌjԌjjJ?"B $ 3 rԌjԌjDo?"000 $  fԌjԌjjJ?"p` $  fԌjԌjjJ?"p` $  fԌjԌjjJ?"0`B $@ # lԌjԌjDjJ?" B $@ # lԌjԌjDjJ?"` `B $ 3 rԌjԌjDo?"p0p $  fԌjԌj3jJ?" p  $  fԌjԌj3jJ?" p  $  fԌjԌj3jJ?" 0 B $@ # lԌjԌjDjJ?"  B $@ # lԌjԌjDjJ?"  B $ 3 rԌjԌjDo?" 0 H $ 0nv ?O$$$$$$$$$$$$ aff\o   oo|( n(  (r ( S  `F)  `  B(  f`ԌjԌj jJ?"`  @Data parallelism  start with vector architecture Thread parallelism  give execution units local control Loop parallelism  allow fine-grain dataflow communication between execution units Instruction parallelism  add wide issue 4 2 2 z  ` (  `,$D 0B ( C xԌjԌjD3>?"0 0 0B ( C xԌjԌjD3>?"@00 ( # lԌjԌjjJ?"  ` ( # lԌjԌjjJ?"0 ` ( # lԌjԌjjJ?"`$Gl   ( ,$D 0 ( # lԌjԌjjJ?"` : (  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"` @  ( # lԌjԌj3jJ?" 0p B ( C xԌjԌjDo?"0  B ( C xԌjԌjDo?"P P   ( # lԌjԌj3jJ?" pp   ( # lԌjԌj3jJ?"p p  %( # lԌjԌj3jJ?"` p B ,( C xԌjԌjDo?"P P : .(  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"p @  /( # lԌjԌjjJ?" @ p B 0( C xԌjԌjDo?"@  1( # lԌjԌjjJ?" p  2( # lԌjԌjjJ?" p B 3( C xԌjԌjDo?" P P : 4(  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"0 @  5( # lԌjԌjfjJ?" p B 6( C xԌjԌjDo?"   7( # lԌjԌjfjJ?" @p  8( # lԌjԌjfjJ?"@ p 2 :( # lԌjԌjjJ?"`  2 ;( # lԌjԌjjJ?"  2 <( # lԌjԌjjJ?"  P B =( C xԌjԌjD3>?"0 B >( C xԌjԌjD3>?"@   ?( # lԌjԌjjJ?"  0  @( # lԌjԌjjJ?"0 0  A( # lԌjԌjjJ?" 0 : u(  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"` p v( # lԌjԌj3jJ?" 0B w( C xԌjԌjDo?"0  B x( C xԌjԌjDo?"   y( # lԌjԌj3jJ?" p z( # lԌjԌj3jJ?"p  {( # lԌjԌj3jJ?"` B ( C xԌjԌjDo?" : (  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"p p ( # lԌjԌjjJ?" @ B ( C xԌjԌjDo?"@  ( # lԌjԌjjJ?"  ( # lԌjԌjjJ?" B ( C xԌjԌjDo?"  : (  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"0 p ( # lԌjԌjfjJ?" B ( C xԌjԌjDo?"   ( # lԌjԌjfjJ?" @ ( # lԌjԌjfjJ?"@ 2 ( # lԌjԌjjJ?"`  2 ( # lԌjԌjjJ?"  2 ( # lԌjԌjjJ?"  P B ( C xԌjԌjD3>?"0 0 0B ( C xԌjԌjD3>?"@00 ( # lԌjԌjjJ?"  ` ( # lԌjԌjjJ?"0 ` ( # lԌjԌjjJ?"` ( # lԌjԌjjJ?" ` ( # lԌjԌj3jJ?" `p  ( # lԌjԌj3jJ?" ` ( # lԌjԌjjJ?"  ( # lԌjԌj3jJ?" p  ( # lԌjԌj3jJ?"  ( # lԌjԌjjJ?"@  ( # lԌjԌj3jJ?"@ p  ( # lԌjԌj3jJ?"@ B (B 3 rԌjԌjDjJ?"  B (B 3 rԌjԌjDjJ?"B (B 3 rԌjԌjDjJ?"  B (B 3 rԌjԌjDjJ?"p p B (B 3 rԌjԌjDjJ?" B (B 3 rԌjԌjDjJ?"  B (B 3 rԌjԌjDjJ?" B (B 3 rԌjԌjDjJ?" p p B (B 3 rԌjԌjDjJ?"  B (B 3 rԌjԌjDjJ?"B (B 3 rԌjԌjDjJ?"  B (B 3 rԌjԌjDjJ?"p p B (B 3 rԌjԌjDjJ?" @ B (B 3 rԌjԌjDjJ?"@z @ ( @,$D 0`T `P  (# `P 2 (  fԌjԌjjJ?"` 2 (  fԌjԌjjJ?" 2 (  fԌjԌjjJ?" P  T `p (# `p4 (  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"` p4 (  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"p p4 (  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"0p T @ (# @ (  fԌjԌj3jJ?"` (  fԌjԌj3jJ?"` (  fԌjԌj3jJ?" (  fԌjԌj3jJ?"@z  ( ,$D 0 ( # lԌjԌjfjJ?"B ( C xԌjԌjDo?" ( # lԌjԌjfjJ?"@ ( # lԌjԌjfjJ?"@B (B 3 rԌjԌjDjJ?"B (B 3 rԌjԌjDjJ?" ( # lԌjԌj3jJ?"0B ( C xԌjԌjDo?"0B ( C xԌjԌjDo?" ( # lԌjԌj3jJ?"p ( # lԌjԌj3jJ?"pB ( C xԌjԌjDo?"  ( # lԌjԌjjJ?" @ B ( C xԌjԌjDo?"@   ( # lԌjԌjjJ?"   ( # lԌjԌjjJ?"  B ( C xԌjԌjDo?" B (B 3 rԌjԌjDjJ?"B (B 3 rԌjԌjDjJ?"B (B 3 rԌjԌjDjJ?"  B (B 3 rԌjԌjDjJ?"  H ( 0nv ? affJ  DJL  f̙jJ?" 5  QloadA  ?L  f3jJ?"5~ Oadd  @L  fDjJ?"5u Qstore  AL  f4̙jJ?"   QloadA  BL  fԾ̙jJ?" X  QloadA  CL  ft£̙jJ?" D  QloadA  DL  fţ̙jJ?"y 5  QloadA  EL  fɣ̙jJ?"y   QloadA  FL  f̣̙jJ?"y X  QloadA  GL  fϣ̙jJ?"y D  QloadA  HL  fӣ̙jJ?" 5  QloadA  IL  fף̙jJ?"   QloadA  JL  fڣ̙jJ?" X  QloadA  KL  fޣ̙jJ?" D  QloadA  LL  f̙jJ?"t 5  QloadA  ML  f̙jJ?"t   QloadA  NL  f̙jJ?"t X  QloadA  OL  f̙jJ?"t D  QloadA  PL  f̙jJ?" 5  QloadB  QL  f ̙jJ?"   QloadB  RL  f̙jJ?" X  QloadB  SL  f̙jJ?" D  QloadB  TL  f̙jJ?"p 5 QloadB  UL  f̙jJ?"p  QloadB  VL  fh̙jJ?"p X QloadB  WL  f ̙jJ?"p D QloadB  XL  f̙jJ?" 5 QloadB  YL  f̙jJ?"  QloadB  ZL  f$̙jJ?" X QloadB  [L  fp̙jJ?" D QloadB  \L  f̙jJ?"l5 QloadB  ]L  f̙jJ?"l QloadB  ^L  f$ ̙jJ?"lX QloadB  _L  fp#̙jJ?"lD QloadB  `L  f&3jJ?"~ Oadd  aL  f)3jJ?"X~ Oadd  bL  f$-3jJ?"D~ Oadd  cL  fp03jJ?"i5 Oadd  dL  f33jJ?"i Oadd  eL  f63jJ?"iX Oadd  fL  f$:3jJ?"iD Oadd  gL  fp=3jJ?"5y Oadd  hL  f@3jJ?"y Oadd  iL  fC3jJ?"Xy Oadd  jL  f$G3jJ?"Dy Oadd  kL  fpJ3jJ?"d5 Oadd  lL  fM3jJ?"d Oadd  mL  fP3jJ?"dX Oadd  nL  f$T3jJ?"dD Oadd  oL  fpWjJ?"u Qstore  pL  fZjJ?"Xu Qstore  qL  f]jJ?"Du Qstore  rL  f$ajJ?"a5 Qstore  sL  fpdjJ?"a Qstore  tL  fgjJ?"aX Qstore  uL  fjjJ?"aD Qstore  vL  f$njJ?"5q Qstore  wL  f,qjJ?"q Qstore  xL  ftjJ?"Xq Qstore  yL  fwjJ?"Dq Qstore  zL  f0{jJ?"]5 Qstore  {L  f0jJ?"\ Qstore  |L  fЂjJ?"\X Qstore  }L  f셴jJ?"\D Qstore  ~L 3 r4ԌjԌj jJ?" <  RLane 0  L 3 rLԌjԌj jJ?"   RLane 1  L 3 rԌjԌj jJ?" <  RLane 2  L 3 r쓴ԌjԌj jJ?" 4|  RLane 3  L  f8ԌjԌj jJ?" 00 Lanes contain regfiles and execution units  VPs map to lanes and share physical resources Operations execute in parallel across lanes and sequentially for each VP mapped to a lane  control overhead amortized to save energy 2 h! L # lԌjԌj jJ?"P P  wExecution on Vector Processor(  L # lԌjԌjjJ?"   L # lԌjԌjjJ?"   L # lԌjԌjjJ?" @  L # lԌjԌjjJ?"p  L # lԌjԌj3jJ?" L # lԌjԌj3jJ?"p L # lԌjԌj3jJ?"0 L # lԌjԌj3jJ?"0B L C xԌjԌjDo?"`p L 3 rԌjԌj jJ?"@P  bfrom control processor 2 L  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"6 B L C xԌjԌjDo?"@B L C xԌjԌjDo?"@  L   L# p@ p L  fԌjԌjjJ?" @ L  fԌjԌjjJ?"@ p L  fԌjԌjjJ?"p  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"P  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  P L  fԌjԌjjJ?"  L  fԌjԌjjJ?" 0 L  fԌjԌjjJ?"0 ` L  fԌjԌjjJ?"`  L  fԌjԌj jJ?"  L 3 rԌjԌj jJ?"O  OVP0 B L 3 rԌjԌjDjJ?"@  L 3 rдԌjԌj jJ?"O  OVP4 B L 3 rԌjԌjDjJ?"?  L 3 rԌjԌj jJ?" O  OVP8 B L 3 rԌjԌjDjJ?"0? 0 L 3 rԌjԌj jJ?"` ` PVP12 2 L  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"& B L C xԌjԌjDo?"  L   L# p0 p L  fԌjԌjjJ?" @ L  fԌjԌjjJ?"@ p L  fԌjԌjjJ?"p  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"P  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  P L  fԌjԌjjJ?"  L  fԌjԌjjJ?" 0 L  fԌjԌjjJ?"0 ` L  fԌjԌjjJ?"`  L  fԌjԌj jJ?"  L 3 rpԌjԌj jJ?"  OVP1 B L 3 rԌjԌjDjJ?"0  L 3 rĴԌjԌj jJ?"  OVP5 B L 3 rԌjԌjDjJ?"/  L 3 rpȴԌjԌj jJ?"    OVP9 B L 3 rԌjԌjDjJ?"0/ 0 L 3 r˴ԌjԌj jJ?"` ` PVP13 2 L  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"vB L C xԌjԌjDo?"0 L   L# pp L  fԌjԌjjJ?" @ L  fԌjԌjjJ?"@ p L  fԌjԌjjJ?"p  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"P  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  P L  fԌjԌjjJ?"  L  fԌjԌjjJ?" 0 L  fԌjԌjjJ?"0 ` L  fԌjԌjjJ?"`  L  fԌjԌj jJ?"  L 3 rдԌjԌj jJ?"@ OVP2 B L 3 rԌjԌjDjJ?" L 3 r մԌjԌj jJ?"@ OVP6 B L 3 rԌjԌjDjJ?" L 3 rشԌjԌj jJ?" HP  PVP10 B L 3 rԌjԌjDjJ?"00 L 3 rLܴԌjԌj jJ?"`@H` PVP14 2 L  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"& L   L# p0p L  fԌjԌjjJ?" @ L  fԌjԌjjJ?"@ p L  fԌjԌjjJ?"p  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"P  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  L  fԌjԌjjJ?"  P L  fԌjԌjjJ?"  M  fԌjԌjjJ?" 0 M  fԌjԌjjJ?"0 ` M  fԌjԌjjJ?"`  M  fԌjԌj jJ?"  M 3 rԌjԌj jJ?"? OVP3 B M 3 rԌjԌjDjJ?"0 M 3 rԌjԌj jJ?"? OVP7 B M 3 rԌjԌjDjJ?"/ M 3 r\ԌjԌj jJ?"   PVP11 B  M 3 rԌjԌjDjJ?"0/0  M 3 rԌjԌj jJ?"`` PVP15   M 3 rԌjԌj jJ?"@K @ RLane 0   M 3 rԌjԌj jJ?"@ %@ RLane 1   M 3 r8ԌjԌj jJ?"@u@ RLane 2  M 3 rtԌjԌj jJ?"@@ RLane 3 B M@ 3 rԌjԌjDjJ?"@pB M@ 3 rԌjԌjDjJ?"@p  M 3 r ԌjԌj jJ?"p\ [Microarchitecture  M Z̙jJ?"; P Pload A  M Z̙jJ?"P;  Pload B  M Zt 3jJ?";  Madd  M ZjJ?";  Ostore  M # lԌjԌj jJ?"` > ` [vector-execute:  M # lԌjԌj jJ?"  >  [vector-execute:  M # lԌjԌj jJ?" >  [vector-execute:  M # lԌjԌj jJ?" >  [vector-execute: H L 0nv ? aff?  _?W?>>>(  < N<  fԌjԌjjJ?"P  0  `<  fԌjԌjjJ?"P P0  q<  fԌjԌjjJ?"Pp0 r < S $F)   B G< 3 rԌjԌjDo?", I<  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"0&  J<  fԌjԌjjJ?"0  K<  fԌjԌjjJ?"0 @ L<  fԌjԌjjJ?"@0 p M<  fԌjԌjjJ?"p0  O< # lT'ԌjԌj jJ?"`a ` OVP0  P<  fԌjԌj3jJ?"0p Q<  fԌjԌj3jJ?"0p R<  fԌjԌj3jJ?"0B S< 3 rԌjԌjDo?"  B T<@ # lԌjԌjDjJ?"00B U<@ # lԌjԌjDjJ?"B V< 3 rԌjԌjD3>?"  W<  fԌjԌjjJ?"p B Z< 3 rԌjԌjDo?" , [<  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"0  \<  fԌjԌjjJ?"  ]<  fԌjԌjjJ?" @ ^<  fԌjԌjjJ?"@ p _<  fԌjԌjjJ?"p  a< # l/ԌjԌj jJ?"` p` OVP1  b<  fԌjԌj3jJ?"0 0  c<  fԌjԌj3jJ?"0p  d<  fԌjԌj3jJ?"0 p B e< 3 rԌjԌjDo?" 0 B f<@ # lԌjԌjDjJ?"0  0B g<@ # lԌjԌjDjJ?"  B h< 3 rԌjԌjD3>?"`@ i<  fԌjԌjjJ?"pP0 j<  ԌjԌj    BCDE F o 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||@S"   k<  ԌjԌj    BCDE F o 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||@S", l<  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"0 m<  fԌjԌjjJ?" n<  fԌjԌjjJ?"@ o<  fԌjԌjjJ?"@p p<  fԌjԌjjJ?"p r< # l 7ԌjԌj jJ?"`Y` SVP(N-1)  s<  fԌjԌj3jJ?"0P t<  fԌjԌj3jJ?"0 u<  fԌjԌj3jJ?"00B v< 3 rԌjԌjDo?" P B w<@ # lԌjԌjDjJ?"000B x<@ # lԌjԌjDjJ?"0 y<  fԌjԌjjJ?"ppP z<  ԌjԌj    BCDE F o 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||@S"@0B {< 3 rԌjԌjDo?" 0D8 `0P` <p@2 |<  fԌjԌjjJ?"`0`2 }<  fԌjԌjjJ?"0`2 ~<  fԌjԌjjJ?" 0P` < # lԌjԌj jJ?"`X} Y slave control   < # lBԌjԌj jJ?"@J bmicro-threaded control  > # l4FԌjԌj jJ?"p [Programming Model  >  `@JԌjԌj jJ?"3 @  bcross-VP communication  >  ԌjԌj    BCDE F jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E|| @S"@ ` L >  fMԌjԌj jJ?" 0f Vector of Virtual Processors (similar to traditional vector architecture) VPs are decoupled  local instruction queues break the rigid synchronization of vector architectures Under slave control, the control thread sends instructions to all VPs Under micro-threaded control, each VP fetches its own instructions Cross-VP communication allows each VP to send data to its successorp| 28'.| hH < 0nv ? aff^2  229yH/(  Hr H S VF)    H  fԌjԌjjJ?" 6  H  `ajJ?"$F  Precv  H  `ld̙jJ?"   Pload  H  `g3jJ?"`6 , Oadd  H  `jjJ?" l  Psend  H # lԌjԌjjJ?" Z  H@ # lԌjԌjjJ?" P Z  H # lԌjԌjjJ?"2   H # lpnԌjԌj jJ?"  Oi=0   H  `rԌjԌj 1?"`0 } 3for (i=0; i|  fԌjԌjwwwjJ?"P  ?|  fԌjԌj3jJ?"0  @|  fԌjԌj3jJ?"  F   A|   B|  fԌjԌjjJ?" @ C|  fԌjԌjjJ?"@ p D|  fԌjԌjjJ?"p  E|  fԌjԌjjJ?"  F|  fԌjԌjjJ?"P  G|  fԌjԌjjJ?"  H|  fԌjԌjjJ?"  I|  fԌjԌjjJ?"  J|  fԌjԌjjJ?"  K|  fԌjԌjjJ?"  L|  fԌjԌjjJ?"  M|  fԌjԌjjJ?"  P N|  fԌjԌjjJ?"  O|  fԌjԌjjJ?" 0 P|  fԌjԌjjJ?"0 ` Q|  fԌjԌjjJ?"`  R|  fԌjԌj jJ?"  S| # lԌjԌj jJ?"/ OVP0 B T| # lԌjԌjDjJ?"  U| # lPԌjԌj jJ?"/ OVP4 B V| # lԌjԌjDjJ?" W| # lԌjԌj jJ?"2/2 OVP8 B X| # lԌjԌjDjJ?"BB Y| # l|ԌjԌj jJ?"rr PVP12 , Z|  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" B [| 3 rԌjԌjDo?"0  \|  fԌjԌjjJ?" `r  ]|  fԌjԌj3jJ?"  ^|  fԌjԌj3jJ?"p  _|  fԌjԌj3jJ?"@p  `|  fԌjԌj3jJ?"@  a|  fԌjԌj3jJ?"  b|  fԌjԌj3jJ?"  c|  fԌjԌj3jJ?"` B d| 3 rԌjԌjDo?"B `P B bF @ 0b e| @0B f|  fԌjԌjjJ?" 0b g|  fԌjԌjjJ?" b h|  fԌjԌjjJ?" b i|  fԌjԌjjJ?"p bB j| # lԌjԌjDjJ?"@bpbB k|B # lԌjԌjDjJ?"@ p B l|@ C xԌjԌjD1?"Bp0B m|@ C xԌjԌjD1?"BB n| C xԌjԌjD1?"B0 o|  fԌjԌj jJ?"0  p|  fԌjԌj3jJ?"  q|  fԌjԌj3jJ?"  F   r|   s|  fԌjԌjjJ?" @ t|  fԌjԌjjJ?"@ p u|  fԌjԌjjJ?"p  v|  fԌjԌjjJ?"  w|  fԌjԌjjJ?"P  x|  fԌjԌjjJ?"  y|  fԌjԌjjJ?"  z|  fԌjԌjjJ?"  {|  fԌjԌjjJ?"  ||  fԌjԌjjJ?"  }|  fԌjԌjjJ?"  ~|  fԌjԌjjJ?"  P |  fԌjԌjjJ?"  |  fԌjԌjjJ?" 0 |  fԌjԌjjJ?"0 ` |  fԌjԌjjJ?"`  |  fԌjԌj jJ?"  | # lHԌjԌj jJ?"P  OVP1 B | # lԌjԌjDjJ?"  | # lԌjԌj jJ?"P  OVP5 B | # lԌjԌjDjJ?"  | # lHԌjԌj jJ?"2P 2 OVP9 B | # lԌjԌjDjJ?"B B | # lԌjԌj jJ?"rP X r PVP13 , |  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S" B | 3 rԌjԌjDo?"  |  fԌjԌjjJ?" P 0 r  |  fԌjԌj3jJ?"p  |  fԌjԌj3jJ?"@p  |  fԌjԌj3jJ?"@  |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"0 ` B | 3 rԌjԌjDo?"B 0 B bF  b |  B |  fԌjԌjjJ?" b |  fԌjԌjjJ?" b |  fԌjԌjjJ?"p b |  fԌjԌjjJ?"@ pbB | # lԌjԌjDjJ?"b@bB |B # lԌjԌjDjJ?" @ B |@ C xԌjԌjD1?"B@ B |@ C xԌjԌjD1?"BB | C xԌjԌjD1?"Bp  |  fԌjԌj jJ?"  |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"p  F   | ` |  fԌjԌjjJ?" @ |  fԌjԌjjJ?"@ p |  fԌjԌjjJ?"p  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"P  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  P |  fԌjԌjjJ?"  |  fԌjԌjjJ?" 0 |  fԌjԌjjJ?"0 ` |  fԌjԌjjJ?"`  |  fԌjԌj jJ?"  | # lԌjԌj jJ?"  OVP2 B | # lԌjԌjDjJ?"` | # l4BԌjԌj jJ?"  OVP6 B | # lԌjԌjDjJ?"_ | # l|BԌjԌj jJ?"2 (2 PVP10 B | # lԌjԌjDjJ?"B_B | # l4 BԌjԌj jJ?"r (r PVP14 , |  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"V B | 3 rԌjԌjDo?" |  fԌjԌjjJ?" r  |  fԌjԌj3jJ?"@p  |  fԌjԌj3jJ?"@  |  fԌjԌj3jJ?"   |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"P  |  fԌjԌj3jJ?"0 B | 3 rԌjԌjDo?"B B bF  b |  B |  fԌjԌjjJ?" b |  fԌjԌjjJ?"p b |  fԌjԌjjJ?"@ pb |  fԌjԌjjJ?" @bB | # lԌjԌjDjJ?" bbB |B # lԌjԌjDjJ?"  B |@ C xԌjԌjD1?"BB |@ C xԌjԌjD1?"B B | C xԌjԌjD1?"B@ |  fԌjԌj jJ?"P   |  fԌjԌj3jJ?"p  |  fԌjԌj3jJ?"@p  F   | 0 |  fԌjԌjjJ?" @ |  fԌjԌjjJ?"@ p |  fԌjԌjjJ?"p  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"P  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  |  fԌjԌjjJ?"  P |  fԌjԌjjJ?"  |  fԌjԌjjJ?" 0 |  fԌjԌjjJ?"0 ` |  fԌjԌjjJ?"`  |  fԌjԌj jJ?"  | # lXBԌjԌj jJ?" OVP3 B | # lԌjԌjDjJ?"0 | # lBԌjԌj jJ?" OVP7 B | # lԌjԌjDjJ?"/ | # lXBԌjԌj jJ?"22 PVP11 B | # lԌjԌjDjJ?"B/B | # l!BԌjԌj jJ?"rr PVP15 , |  ԌjԌj    B`C DE F*  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||  ``P@`S"& B | 3 rԌjԌjDo?" |  fԌjԌjjJ?" r  |  fԌjԌj3jJ?"@  |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"  |  fԌjԌj3jJ?"P  |  fԌjԌj3jJ?" P  |  fԌjԌj3jJ?"   |  ԌjԌjBCDEF o?2 @"B [ bF  b | B |  fԌjԌjjJ?"p b |  fԌjԌjjJ?"@ pb |  fԌjԌjjJ?" @b |  fԌjԌjjJ?" bB | # lԌjԌjDjJ?"bbB |B # lԌjԌjDjJ?"  B |@ C xԌjԌjD1?"BB }@ C xԌjԌjD1?"BPpB } C xԌjԌjD1?"B }  fԌjԌj jJ?"  B } 3 rԌjԌjD___>?"R R B } 3 rԌjԌjD___>?"R R B } 3 rԌjԌjD___>?"R P R B } 3 rԌjԌjD___>?"R P R B } 3 rԌjԌjD___o?" p B } 3 rԌjԌjD___o?" p@ B  } 3 rԌjԌjD___o?" @ B  } 3 rԌjԌjD___o?"    }  ԌjԌjBCDE F___o?@"p  }  ԌjԌjBCDE F___o?@"@  }  ԌjԌjBCDE F___o?@"  }  ԌjԌjBCDE F___o?@"P } # l.BԌjԌj jJ?"PFP RLane 0  } # l$1BԌjԌj jJ?"`K ` RLane 1  } # l,5BԌjԌj jJ?"`p` RLane 2  } # l8BԌjԌj jJ?"`@` RLane 3  } # l=BԌjԌj jJ?"@f [Microarchitecture  }  `@BԌjԌj jJ?" ^execute directives  }  `DBԌjԌj jJ?"T  X Instr. cache     }  f IBԌjԌj jJ?"T  Y Instr. fill (2    }  `@LBԌjԌj jJ?"P l =  ]do-across network T }  fOBԌjԌj jJ?" 3 6VPs striped across lanes as in traditional vector machine Lanes have small instruction cache (e.g. 32 instr s), decoupled execution Execute directives point to atomic instruction blocks and indicate which VP(s) the AIB should be executed for  generated by control thread vector-execute command, or VP fetch instruction Do-across network includes dataflow handshake signals  receiver stalls until data is ready8 2 hH | 0nv ? affR  bRZRZEQ(  Dr D S LյF)    D  fHٵjJ?" \ Precv  D  f@ݵ̙jJ?"I \ Pload  D  f3jJ?" \^ Oadd  D  fjJ?"N \ Psend  D  fpjJ?" \ Qstore  D  fjJ?"N7 Precv  D  f̙jJ?"I7 Pload  D  f3jJ?"7 Oadd  D  fjJ?"7Y Psend  D  fjJ?"I7 Qstore  D  fjJ?"yY Precv  D  f$̙jJ?"Iy Pload  D  fd3jJ?"Iy Oadd  D  f$jJ?"y Psend  D  f jJ?"yS  Qstore  D  fjJ?"T Precv  D  f̙jJ?"IT Pload  D  fd3jJ?"TS  Oadd  D  f<jJ?"B T  Psend  D  fXjJ?" T  Qstore  D # lԌjԌj jJ?"7N D # lԌjԌj jJ?"y D # lԌjԌj jJ?"TB D C xԌjԌjDԔ?"\7B D C xԌjԌjDԔ?"B D C xԌjԌjDԔ?"FyTFB D@ C xԌjԌjDԔ?" \T  D  fpjJ?" 7m  Precv  D  fx!̙jJ?"7>  Pload  D  f%3jJ?"\ 7  Oadd  D  f jJ?" 7  Psend  D  f+jJ?" 7h  Qstore  D # lԌjԌj jJ?"9 7  D  fH/jJ?"d \  Precv  D  f2̙jJ?"3 \ Pload  D  f53jJ?" \  Oadd  D  f9jJ?" \m  Psend  D  fH<jJ?"\ \  Qstore  D # lԌjԌj jJ?" \c  D  f?jJ?" y  Precv  D  fC̙jJ?" y  Pload  D  f4F3jJ?" y~  Oadd  D  fXIjJ?"m y& Psend  D  fLjJ?"y Qstore  D # lԌjԌj jJ?" y  D  fHPjJ?"m T& Precv  D  fS̙jJ?"> T  Pload  D  f|W3jJ?"T Oadd  D  fZjJ?"Tw Psend  D  f]jJ?"gT  Qstore  D # lԌjԌj jJ?" Tm B D C xԌjԌjDԔ?" \7 B D C xԌjԌjDԔ?"e e B D C xԌjԌjDԔ?" yT  D # l`bԌjԌj jJ?"pH ] RLane 0  D # lfԌjԌj jJ?"p(] RLane 1  D # liԌjԌj jJ?"p] RLane 2  D # l@mԌjԌj jJ?"p@] RLane 3  D  fHpԌjԌj jJ?"  Dataflow execution resolves do-across dependencies dynamically Independent instructions execute in parallel  performance adapts to software critical path Instruction fetch overhead amortized across loop iterations* 2 hB D@ 3 rԌjԌjDԔ?"$d\$ D  `rjJ?"A? Precv  D  `4o̙jJ?"? Pload  D  `3jJ?"? Oadd  D  `jJ?"?M Psend  D  ``jJ?"<? Qstore  D  fԌjԌj jJ?"?A D  ``jJ?" d Precv  D  `̙jJ?" dy  Pload  D  `3jJ?" dQ Oadd  D  `djJ?"A d Psend  D  `djJ?" d Qstore  D  fԌjԌj jJ?"t d D  `,jJ?"c Precv  D  `t̙jJ?"y2 Pload  D  `t3jJ?"Q  Oadd  E  fԌjԌj jJ?", E  `<̙jJ?"\ Pload  E  fԌjԌj jJ?"~\B E 3 rԌjԌjDԔ?"d?B  E 3 rԌjԌjDԔ?"  E  `̙jJ?"8 d Pload   E  fԌjԌjjJ?"P>p E Z0̙jJ?"t.@ Nload  E Zд3jJ?". Madd  E ZjJ?"t.@ Ostore  E Z`jJ?"4. Nrecv  E ZjJ?". Nsend  E # l\·ԌjԌj jJ?"`` [vector-execute: H D 0nv ? aff  91!&\(  \r \ S  ȷF)    \  f@ɷԌjԌj jJ?" P zVPs also have the ability to fetch their own instructions enabling each VP to execute its own thread of control Control thread can send a vector fetch instruction to all VPs (i.e. vector fork)  allows efficient thread startup Control thread can stall until micro-threads  finish (stop fetching instructions) Enables data-dependent control flow within a loop iteration (alternative to predication) 2 h \  fԌjԌjjJ?" \  fԌjԌjjJ?"  \  fԌjԌjjJ?"@`  \  `ͷ3jJ?" D  \  `,ڷ3jJ?"  D   \  `tܷ3jJ?"`, D   \  `|߷3jJ?"`@| D   \ # lԌjԌjjJ?"PZ  \@ # lԌjԌjjJ?"P@Z  \ # lԌjԌjjJ?"2PP \  `$jJ?" ` | D  \  `jJ?"0 `  D  \  `jJ?" | D  \ # lԌjԌjjJ?" * \ # lԌjԌjjJ?"  \  `,fjJ?"  D  \  `fjJ?"p D  \  `fjJ?" D  \  `fjJ?" D  \ # lԌjԌjjJ?" \@ # lԌjԌjjJ?"v \ # lԌjԌjjJ?"2 \  fԌjԌjjJ?"@p2  \  fԌjԌjjJ?"2 !\  fԌjԌjjJ?"0 "\ # lԌjԌj jJ?"` OVP0  #\ # lԌjԌj jJ?" _  OVP1  $\ # l(ԌjԌj jJ?"Y SVP(N-1)  %\  `fjJ?" D  &\@ # lԌjԌjjJ?"@H \ 0nv ? \ \ \\ \ \ \ \ \\\\\\\\\\\\\ \\\ \%\&\ aff  @8 (  r  S  F)      fԌjԌj jJ?"n0 zLoops are ubiquitous and contain ample parallelism across iterations Super-scalar: must track dependencies between all instructions in a loop body (and correctly predict branches) before executing instruction in the subsequent iteration& and do this repeatedly for each loop iteration VLIW: software pipelining exposes parallelism, but requires static scheduling which is difficult and inadequate with dynamic latencies and dependencies Vector: efficient, but limited to do-all loops, no do-across Vector-thread: Software efficiently exposes parallelism, and dynamic dataflow automatically adapts to critical path. Uses simple in-order execution units, and amortizes instruction fetch overhead across loop iterationsE 2 2E 7   0HfH  0nv ? aff  H@0Nu`@(  `r ` S LF)   ^ ` s ԌjԌj    BCDEF@ @  o 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||((fq3gnw~- a"/35[0g\1`?tYfS"0Bgu @             S"0  `  fԌjԌj 33jJ?"b ZControl Thread   ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"QRg  ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"  ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"(  ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"_  ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"%: ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"Y6jo ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"B ` 3 rԌjԌjD33jJ?"QQB `@ 3 rԌjԌjDjJ?"B `@ 3 rԌjԌjDjJ?"B `@ 3 rԌjԌjDjJ?"_B `@ 3 rԌjԌjDjJ?"%_%B `@ 3 rԌjԌjDjJ?"Y6YB `@ 3 rԌjԌjDjJ?"6B `@ 3 rԌjԌjD33jJ?"}B ` 3 rԌjԌjDjJ?"Y ` 3 r,ԌjԌj jJ?"` M ^DO-ACROSS Loop  ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"QRg ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S" ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"( ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"_  ` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"%: !` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"Y6jo "` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"B $` 3 rԌjԌjDjJ?"QQB %`@ 3 rԌjԌjDjJ?"B &`@ 3 rԌjԌjDjJ?"B '`@ 3 rԌjԌjDjJ?"_B (`@ 3 rԌjԌjDjJ?"%_%B )`@ 3 rԌjԌjDjJ?"Y6YB *`@ 3 rԌjԌjDjJ?"6B -` 3 rԌjԌjDjJ?"Y 0` S TԌjԌj    BKC&DEF @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*6NF{BK&@  S"z   1` S ԌjԌj    BC.DEdF, @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||)40*@5XFRRh;4@=$GA[1$uj@ @.@        S"z   2` S ԌjԌj    BCrDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||#DBj^tecj-r @     S"z jS1  3` S tԌjԌj    BC@DE4F @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E|| ##:E^j^) $."4@@ @    S"z  B 5` 3 rԌjԌjDjJ?"z {*  6`  DԌjԌj    BC DEF  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E|| ``   @S" o 2 7` # lԌjԌj33jJ?"y 9 2 8` # lԌjԌj33jJ?"R U  9`  f7ԌjԌj jJ?"p :  ]Micro-threading B >` C xԌjԌjDjJ?"z *   U` 3 r<ԌjԌj fjJ?"`  [ DO-ALL Loop    B V` 3 rԌjԌjDjJ?"2B W` 3 rԌjԌjDjJ?"2 X` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"O Y` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S" Z` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S" [` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"% \` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"&Z ]` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"\ ^` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S" _` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S" `` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"1 a` S ԌjԌj    B/CDE@F  @  jJ 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||*/#^'j//--)'! @     S"2g2 b` # lԌjԌj33jJ?"w0 d` 3 rhCԌjԌj 33jJ?"t bVirtual Processors  g`  `GԌjԌj jJ?"r 0 {#The Vector-Thread Architecture seeks to efficiently exploit the available parallelism in any given application Using the same set of resources, it can flexibly transition from pure data parallel operation, to parallel loop execution with do-across dependencies, to fine-grain multi-threading$$$ B i` # lԌjԌjD)?"/ xX   j` 3 rRԌjԌj1?"X X@  \Vector-Threading  k` 3 rVԌjԌj1?"G I  OILP B l` # lԌjԌjD)?"iX  m` 3 r[ԌjԌj1?"F SThreads  n`@ 3 rYԌjԌj1?"L[N RVector B o`@ # lԌjԌjD)?"X  p`@ 3 r bԌjԌj1?" PLoop B q` # lԌjԌjD>?"  r` 3 reԌjԌj 1?" J  iPerformance & Energy Efficiency    s`  fԌjԌj jJ?"0  B +`@ 3 rԌjԌjDjJ?"}2 ` # lԌjԌj33jJ?"2 t`  fԌjԌj33jJ?"02 ` # lԌjԌj33jJ?"G u` 3 rkԌjԌj jJ?" P  bMulti-paradigm Support H ` 0nv ? aff/  @i$ C(  hx h c $pF)    h ZrԔ?)  H    h  `u1?"L0 Y ctrl proc      h Z(yԔ?"0 bVector Thread Unit  B  h S ~D1?"|B h ZD1?"Bu4 BB h ZD1?"{u4 |B /h # lD1?"00B 0h C xD1?" ``wB 2h # lD1?"@@B Oh C xD1?" ppwB nh # lD1?"PPB oh C xD1?" wB h # lD1?"` ` B h C xD1?"  | h  `1?"?P p  D B h ZD1?"P h # l\ ?"Z  S32b 2 B h C xD1?"0 B h TD?"x h # lȆ ?"\ V128b ( 2 B h C xD1?"0  h@ C x ?"fP V4x128b  R h ZG  ?"JuB h ZD1?"u4 B h ZD1?" u4 B h C xD1?"0 B h TD?"xB h C xD1?"4B h TD?"9a h # l ?" V128b ( 2  h  `I1?"v  TCMMU    h  `I1?" 0 aNetwork Interface    h  `I1?"?PQ  hOutstanding Trans. Table   B h C xD1?"0oo{B h TD?"xZ h # lP ?"\  V128b ( 2 B h C xD1?"0//BB h # lD|?" B h@ # lD|?" B h ZD?"4 _ B h ZD?"o4 _ B h TD?"xEB h TD?"x~B h TD?"x h # l ?"  f V256b ( 2 B h S ~D1?"&vB h S ~D1?"&v h Tuu1? "m/@  h Tuu1? "i=  h Tuu1? "r?@  h Tuu1? "w @  h 3 rԌjԌj jJ?"@  k32KB L1 Configurable I/D Cache    " 8i # lGW1?"B@" 9i # lG1?"myB :i 3 rD1?"B ;i 3 rD1?" i  0e0e    B2ChDE F*  1 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||hh2Rh@`S"6 ?i  `讹 ?"v TIALU   @i  ` ?": ]Cluster 0 (Mem)  " Ai # lG1?" 2B Bi 3 rD1?"z | B Ci 3 rD1?"9 :  Di  f1?" a  Ei  f1?" a  Fi  0e0e    B2ChDE F*  1 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||hh2Rh@`S"   Gi  ` ?"   TIALU   Hi  ` ?" :p-  W Cluster 1     " Ii # lG1?"D B Ji 3 rD1?"  B Ki 3 rD1?"M O  Li  f1?" a  Mi  f1?"a  Ni  0e0e    B2ChDE F*  1 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||hh2Rh@`S"  Oi  ` ?"   VFP-ADD   Pi  `\ù ?":pB  W Cluster 2     " Qi # lG1?"WB Ri 3 rD1?"B Si 3 rD1?"__ Ti  f1?"/a Ui  f1?"a Vi  0e0e    B2ChDE F*  1 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||hh2Rh@`S"  Wi  `ȹ ?" VFP-MUL   Xi  `(̹ ?":pU W Cluster 3      Yi  f1?"rk   Zi  f?"r   [i  f?"r  \i  f?"r  ]i  f?"ra|  ^i  f?"r*E  _i  f?"r  `i  f?"r  ai  f?"r  bi  `1?"qk  L s*(- ci# + a  di Z1?"s*~(- ei Z?"s-~(- fi Z?"s,~, gi Z?"su,~, hi Z?"s-,~Q, ii Z?"s+~ , ji Z?"s+~+ ki Z?"sV+~z+ li Z?"s+~3+ mi T1?"t*&-L s*(- ni# B a  oi Z1?"s*~(- pi Z?"s-~(- qi Z?"s,~, ri Z?"su,~, si Z?"s-,~Q, ti Z?"s+~ , ui Z?"s+~+ vi Z?"sV+~z+ wi Z?"s+~3+ xi T1?"t*&-L s*(- yi# Wa zi Z1?"s*~(- {i Z?"s-~(- |i Z?"s,~, }i Z?"su,~, ~i Z?"s-,~Q, i Z?"s+~ , i Z?"s+~+ i Z?"sV+~z+ i Z?"s+~3+ i T1?"t*&- i  `׹pGpG ?"a ] Local Regfile   i  `1?"y| i  `۹ ?"l@ iInter-Cluster Communication  B i  fDjJ?"   i  f߹ ?"O @  UPrev-VP   i  f ?"   UNext-VP  B i # lD)?"p p B i # lD)?"p @p  i  `pGpG ?" U  ] Local Regfile   i  `TpGpG ?"k U  ] Local Regfile   i  `HpGpG ?"U ] Local Regfile   i 3 rԌjԌj jJ?"7@ gClustered Virtual Processor B i  fD1?"P@PB i  fD1?"P@P i  f ?" 3 VTile ( 2 B Mh # lD1?"<@@B lh # lD1?"<PPB h # lD1?"<` ` B i  fD1?"300B h C xD1?" ``B Nh C xD1?" ppB mh C xD1?" B h C xD1?"  B i  `DjJ?"  B i  `DjJ?"  B i  `DjJ?"OQF  0 i   h Z$Ԕ?   F   h  `o? "; ; 7 h Z? " #  SALU   h Z ? " #  SALU   h Z( ? " O# C SALU   h Z? " #  SL/S   h Zx? "  SALU   h Z0? "  SALU   h ZH? " O C SALU   h Z? "   SL/S   h Z? "Z b  SALU   h ZD? "Z b  SALU   h Z"? "Z Ob C SALU   h ZX'? "Z b  SL/S   h Z*? "   SALU   h Z0.? " OC SALU   h Z0? "  SL/S    h  `43? "1Q D g Instruction Issue Unit  B h TD? " B h TD? "o o B h  fD? " B h  fD? "> > B h  fD? " B h  fD? "} } B h  fD? " OB h  fD? "> > OB h  fD? " OB h  fD? "} }OB h  fD? " C B h  fD? "> C> B h  fD? " C B h  fD? "}C}B h TD? "  B h TD? " B hB  fD? "  B h  fD? "  B hB  fD? "o B h  fD? "o B h  fD? "# o B h TD? "r Mr B hB  fD? " Z B h  fD? " Z B h  fD? "  B h  fD? "  B h TD? "t  B hB  fD? "[  [ B h  fD? "   B hB  fD? "o [ [ B h  fD? "o   B h  fD? "#  o  B hB  fD? " [ Z [ B h  fD? "  Z  B h  fD? "    B iB  fD? " [ [ B i  fD? "   B i  fD? "b   B i  fD? "  B i  `D? " M B i TD? "  B i TD? "M M B iB  fD? " B i  fD? "r rB  iB  fD? "o  B  i  fD? "o r rB  i  fD? "# ro rB  iB  fD? " Z B  i  fD? " rZ rB i  fD? " r rB iB  fD? "  B i  fD? " r rB i  fD? "b r rB i  fD? "rrB i  `D? "rMrB i TD? "rB i TD? "MMrB iB  fD? " B i  fD? " B iB  fD? "o  B i  fD? "o  B i  fD? "# o B iB  fD? " Z B i  fD? " Z B i  fD? "  B iB  fD? "  B i  fD? "  B  i  fD? "b  B !i  fD? "B "i  `D? "MB #i TD? "{B $i TD? "M{MB %i ZDo? "D M B 'i ZDo? "D M  1i NTE ? " [0 ^to tile memory  B 2i TD? " M B 3i TD? "MB 4i TD? "{M{ 5i NI ? "I j  TLane   i ZLM? "   SALU  B hB  fD? " B h  `D? " M B h  fD? " B h  fD? "b B h TD? "Ms M B &i ZDo? "Dg Mg T \ ` i#   pB )i  fD?"\ \ B *i  fD?"  B +i  fD?"  B ,i  fD?"::B -i  fD?"  B .i  fD?"! ! B /i  fD?"  B 0i  fD?"``B (i ZDo? "D|M| i s ԌjԌj    BC@DEF___ ) 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||6IC@@  S"`` @ H h 0nv ? affj  P(  r  S VF)   H  0nv ? affrx@;|>4PGl~)  4A#%0"Y+0 11 N? 4 m(ܘN5Oh+'0p , HX |   The MIT SCALE ProjectPhe Krste Asanovicrrstrst0V-7:09-1999 (SEPTEMBER):091399/CSP:22161XXX.pptRonny KrashinskyEMB3907 KrMicrosoft PowerPoint 4.0913@Pl5 @S^o@"ˮ@plxG ; @  NE&S &&#TNPP2OMiV & TNPP &&TNPP   S --- !S-----u#--ww@9 |ww w0- @"Arial Black9 ww w0- .2 K The Vector . . 2 - . .$2 +Thread Architecture  .--1#-- @"Arial Black9 ww w0- ."2 Ronny Krashinsky,     . .@"Arial Black: ww w0-.12 Chris Batten, Krste Asanovi        .-.@"Arial Black9 ww w0- .02 =Computer Architecture Group       . .<2 e#MIT Laboratory for Computer Science         .@1Courier New9 ww w0- .2 % ronny@mit.edu. .-2 www.cag.lcs.mit.edu/scale.@"Arial BlackW1 9ww w0- .E2 )Boston Area Architecture Workshop (BARC)           . ."2 &January 30th, 2003    .-- "System !w-&TNPP &՜.+,0    {CustomMIT LCS TimesArial Wingdings Arial Black Courier NewTimes New Roman 22161XXXThe Vector-Thread Architecture IntroductionInstruction ParallelismLoop ParallelismData ParallelismThread Parallelism$Vector-Thread Architecture OverviewVector ArchitectureVector MicroarchitectureVector-Thread Architecture Using VPs for Do-Across Loops"Vector-Thread Microarchitecture Do-Across ExecutionMicro-Threading VPs#Loop Parallelism and Architectures%Using the Vector-Thread ArchitectureSCALE-0 OverviewPowerPoint Presentation  Fonts UsedDesign Template Slide Titles(_ΚRonny KrashinskyRonny Krashinsky  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Root EntrydO)Current UserSummaryInformation(PowerPoint Document(DocumentSummaryInformation8