# Interleaver Design for Deep Neural Networks

Sourya Dey, Peter Beerel, Keith Chugg Asilomar Conference on Signals, Systems, and Computers Oct-Nov 2017



### **Overview of Current DNNs**

Key machine learning technologies

- Lot of parameters Memory intensive
- Slow to train Computationally intensive
- Training done offline in CPU/GPU
- Custom hardware used for inference only

2

### Typical Supervised Network



Sourya Dey, USC

neural networks. In: NIPS-2012, pp. 1097-1105 (2012) Zhang, C., Wu, D., Sun, J., Sun, G., Luo, G., Cong, J.: Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In: ISLPED-2016. pp. 326- 331. ACM, New York (2016)

3



### **Overview of our Research**

- Predefined sparsity Memory friendly
  - 2-3x savings on CL only network parameters
  - > 2 orders of magnitude savings on CL parameters of CNNs \*

- Edge-based processing Computationally flexible
- Hardware optimizations Fast training

#### FPGA based architecture - Online training and inference

Dey, S., Shao, Y., Chugg, K.M., Beerel, P.A.: Accelerating Training of Deep Neural Networks via Sparse Edge Processing. In: Proc. ICANN-2017, pp. 273-280. LNCS (2017) \* Dey, S., Huang, K.W., Beerel, P.A., Chugg, K.M.: Characterizing Sparse Connectivity Patterns in Neural Networks. In: ICLR-2018 (submitted for publication)

5

Sourya Dey, USC



Fully connected (FC) network



Fully connected (FC) network Fanout (*fo*) = 4

Sourya Dey, USC



Fully connected (FC) network Fanout (*fo*) = 4 Fanin (*fi*) = 8



Fully connected (FC) network Fanout (*fo*) = 4 Fanin (*fi*) = 8 Connectivity = 100%



Fully connected (FC) network Fanout (*fo*) = 4 Fanin (*fi*) = 8 Connectivity = 100%

Sparse network fo = 1, fi = 2 Connectivity = 25%

### **Sparsity** - Predefined



Fully connected (FC) network Fanout (*fo*) = 4 Fanin (*fi*) = 8 Connectivity = 100% Sparse network fo = 1, fi = 2Connectivity = 25%

Sourya Dey, USC

### **Example of Parameter Savings**



7

### Present Work -Interleavers for Sparse Patterns



### Present Work -Interleavers for Sparse Patterns



### Present Work -Interleavers for Sparse Patterns



Interleaver algorithm ensures:

- Each output connected to a good spatial chunk of different inputs
- No neuron unconnected

### Interleaver Requirements

- Optimized for computational efficiency in hardware
- Optimized for on-chip storage
- High values for metrics which are performance indicators

9

### Degree of Parallelism = z

- z memories for all parameters of same type
- Process z parameters in 1 cycle => 1 from each mem
- Process all parameters in a sweep

| Mem<br>0        | Mem<br>1       |                |  | Mem<br>z-1       |
|-----------------|----------------|----------------|--|------------------|
| W <sub>0</sub>  | W <sub>1</sub> | w <sub>2</sub> |  | W <sub>z-1</sub> |
| Wz              |                |                |  |                  |
| W <sub>2z</sub> |                |                |  |                  |
| W <sub>3z</sub> |                |                |  | W <sub>W-1</sub> |



### Clash Freedom

- E.g. *p* activations, so depth of each memory = p/z
- Accessed in interleaved (permuted order)



Interleaver must prevent clashes when accessing activations







Cycle 2









- Let r be a random permutation of memory row index => Size p/z  $\pi_W(i) = \left(t[i\%p] \times z + i\%z\right) \times fo + \lfloor i/p \rfloor$
- Replicate or partition r to form s of z elements => Starting indices of all mems
- t = {s, s+1, ..., s+p/ z-1}%(p/z) => All p indices for all mems in order

Example: 
$$p=32$$
,  $fo=2$ ,  $z=8 \Rightarrow i \in [0,63]$ . Say  $i = 45$ 

Let r be a random permutation of memory row index => Size p/z

$$\pi_W(i) = \left(t\left[i\%p\right] \times z + i\%z\right) \times fo + \lfloor i/p$$

- Replicate or partition r to form s of z elements => Starting indices of all mems
- t = {s, s+1, ..., s+p/ z-1}%(p/z) => All p indices for all mems in order

- Let r be a random permutation of memory row index => Size p/z
- Replicate or partition r to form s of z elements => Starting indices of all mems
- t = {s, s+1, ..., s+p/ z-1}%(p/z) => All p indices for all mems in order

Example: 
$$p=32$$
,  $fo=2$ ,  $z=8 \Rightarrow i \in [0,63]$ . Say  $i = 45$   
$$\pi_W(i) = \left(t [i\%p] \times z + i\%z\right) \times fo + \lfloor i/p \rfloor$$

Activation Memory Bank Row = 45%32 = 1

| <b>Row1</b> $a_8$ $a_9$ $a_{10}$ $a_{11}$ $a_{12}$ $a_{13}$ $a_{14}$ | a <sub>15</sub> |
|----------------------------------------------------------------------|-----------------|
|----------------------------------------------------------------------|-----------------|

- Let r be a random permutation of memory row index => Size p/z
- Replicate or partition r to form s of z elements => Starting indices of all mems
- t = {s, s+1, ..., s+p/ z-1}%(p/z) => All p indices for all mems in order

Example: 
$$p=32$$
,  $fo=2$ ,  $z=8 \Rightarrow i \in [0,63]$ . Say  $i = 45$   
 $\pi_W(i) = \left(t[i\%p] \times z + i\%z\right) \times fo + \lfloor i/p$   
Activation Memory Bank Row = 45%32 = 1  
Row1 a<sub>8</sub> a<sub>9</sub> a<sub>10</sub> a<sub>11</sub> a<sub>12</sub> a<sub>13</sub> a<sub>14</sub> a<sub>15</sub>  
a<sub>13</sub> Left side Neuron = 1x8+45%8 = 13

- Let r be a random permutation of memory row index => Size p/z
- Replicate or partition r to form s of z elements => Starting indices of all mems
- t = {s, s+1, ..., s+p/ z-1}%(p/z) => All p indices for all mems in order

Example: 
$$p=32$$
,  $fo=2$ ,  $z=8 \Rightarrow i \in [0,63]$ . Say  $i = 45$   

$$\pi_{W}(i) = \left(t\left[i\% p\right] \times z + i\% z\right) \times fo + \lfloor i/p$$
Activation Memory Bank Row = 45%32 = 1  
Row1 a<sub>8</sub> a<sub>9</sub> a<sub>10</sub> a<sub>11</sub> a<sub>12</sub> a<sub>13</sub> a<sub>14</sub> a<sub>15</sub>  
a<sub>13</sub> Left side Neuron = 1x8+45%8 = 13  
Left side Neuron's Weight = 13x2 = 26  
a<sub>13</sub>  
 $\psi_{zz}$ 

- Let r be a random permutation of π memory row index => Size p/z
- Replicate or partition r to form s of z elements => Starting indices of all mems
- t = {s, s+1, ..., s+p/ z-1}%(p/z) => All p indices for all mems in order

Example: 
$$p=32$$
,  $fo=2$ ,  $z=8 \Rightarrow i \in [0,63]$ . Say  $i = 45$   

$$\pi_{W}(i) = \left(t\left[i\% p\right] \times z + i\% z\right) \times fo + \lfloor i/p \rfloor$$
Activation Memory Bank Row = 45%32 = 1  
 $\boxed{\text{Row1} \quad a_{9} \quad a_{10} \quad a_{11} \quad a_{12} \quad a_{33} \quad a_{14} \quad a_{15}}$ 

$$a_{13} \text{ Left side Neuron = 1x8+45\%8 = 13}$$

$$\underbrace{\text{How 1} \quad \text{Left side Neuron's Weight = 13x2 = 26}}_{Weight Offset = 1}$$

- Let r be a random permutation of memory row index => Size p/z
- Replicate or partition r to form s of z elements => Starting indices of all mems
- t = {s, s+1, ..., s+p/ z-1}%(p/z) => All p indices for all mems in order

### **Meeting Requirements**

- Easily generated Proof in paper
  - All variables involved are powers of 2 (add extra neurons)
    - Modulo = Bit select
    - Multiply = Bit shift
  - Only store r for a new pattern
    - Create t by accumulating 1s
- Clash freedom Proof in paper

### Variations

Start Vector Shuffle (SV)

Original: *s* = {2,0,3,1,2,0,3,1} After SV: *s* = {2,0,3,1,**3,0,1,2**}

Sweep Starter Shuffle (SS)

Original: 1<sup>st</sup> sweep *s* = {2,0,3,1,2,0,3,1} 2<sup>nd</sup> sweep *s* = {2,0,3,1,2,0,3,1} After SS: 1<sup>st</sup> sweep *s* = {2,0,3,1,3,0,1,2} 2<sup>nd</sup> sweep *s* = {**0,3,2,1,0,3,2,1**}

Memory Dither (MD)

$$\pi_{W}(i) = \left(t\left[i\% p\right] \times z + \boldsymbol{v}[i\% z]\right) \times fo + \lfloor i/p \rfloor$$
  
v[.] = Permutation of [0,z-1]

### Some Weight Interleaver Patterns



### Spread and Dispersion

- Spread: Connections that are close on 1 side should be far away on other
- Dispersion: Connections should be irregular, i.e. no patterns or trends

| Interleaver Variant | Spread | Dispersion |
|---------------------|--------|------------|
| Basic               | 18.28  | 0.04       |
| MD                  | 7.48   | 0.22       |
| SS                  | 9.7    | 0.07       |
| SS + MD             | 6.5    | 0.37       |
| SV                  | 6.6    | 0.08       |
| SV + MD             | 7.31   | 0.23       |
| SV + SS             | 5.05   | 0.09       |
| SV + SS + MD        | 5.7    | 0.39       |

### **Dataset Results**

![](_page_33_Figure_1.jpeg)

\* Sourya Dey: https://cobaltfolly.wordpress.com/2017/10/15/morse-code-dataset-for-artificial-neural-networks/

### Morse Dataset Trends

![](_page_34_Figure_1.jpeg)

Morse has fewer inputs and low redundancy Spread should be high, dispersion hurts

### Summary and Ongoing Work

- Pre-defined sparse hardware architecture to lower memory and computational footprint
- Interleaver algorithm to guarantee clash freedom and ease of storage
- Interleaver variations and effects on performance

- Extension to multiple junctions Adjacency matrices
- Measures to characterize network performance

Dey, S., Huang, K.W., Beerel, P.A., Chugg, K.M.: Characterizing Sparse Connectivity Patterns in Neural Networks. In: ICLR-2018 (submitted for publication)

## Thank you!

### Questions?

Contact: souryade@usc.edu