# EIE: Efficient Inference Engine on Compressed Deep Neural Network ## Deep Neural Network - Convolutional layers - Fully-connected layers - In FC-layers: Trained weights. This only focuses on inference - Multiply-Accumulate (MAC) on each layer - DNN dataflows - Convolutional layers: 5% of memory, 95% of FLOPs - FC layers: 5% of FLOPs, 90-95% of memory ## Motivation - Inference metrics: throughput, latency, model size, energy use - Uncompressed DNN: Does not fit SRAM, memory access to/from DRAM - Von-Neumann bottleneck - Grafik aus Chen 2016 - Additional levels of indirection because of indices (weight reusing) ## Compression - In general: Encode in such a way, that it reduces the number of bits per weight Trivial: - Use different kernels/filters to the input - Apply pooling to the inputs (runtime memory) More complex: - Pruning (remove unimportant weights and retrain, 2 approaches) - Encode with relative indexing - Weight quantization with clustering - Group similar weights to clusters - Minimalize WCSS - Different methods to initialize cluster centroids, e.g. random, linear, CDF-based - Indirection because of shared weight table lookup - Huffman encoding (binary tree with weights, globally) - Fixed-Point-Quantization of activation functions (refer to CPU optimization) - Extremely narrow weight engines (4 bit) - Compressed sparse column (CSC) matrix representation ## EIE implementation - Per-Activation-Formula - Accelerates sparse and weight sharing networks - Uses CSC representation - PE Quickly finds non-zero elements in column - Explain general procedure - Show image of the architecture - Non-Zero filtering - Queues for load balancing - Two different SRAM banks for pointers (16 bit) to column borders - Each entry: 8 bit width (4 bit reference and 4 bit activation register index) - Table lookup / weight decoding of reference in the same cycle - Arithmetic Unit: Performs Multiply-Accumulate - Read/Write unit - Source and destination register files - Change their role on each layer - Feed-Forward networks ## EIE evaluation - Speedup: 189x, 13x, 307x faster than CPU, GPU and mGPU - EIE latency focused: Batch size of 1 - Throughput: 102 GOP/s compressed -> 3 TOP/s uncompressed - Energy efficiency: 24.000x, 3.400x, 2.700x more energy efficient than CPU, GPU and mGPU - Speed calculation: Measure wall clock times for different workloads - Energy calculation: Total computation time x average measured power - Sources of energy consumption and reasons for less energy consumption: - SRAM access instead of DRAM - Compression type and architecture reduces amount of memory reads - Vector sparsity encoding in CSC representation ## Limitations / future optimizations - EIE only capable of matrix-multiplication - Other optimization methods - In-Memory Acceleration -