Skip to content

Machine Learning Training

Q-Memory memory revolutionizes machine learning by providing ultra-high density weight storage and in-memory analog computation capabilities.

Modern neural networks require massive parameter storage:

  • GPT-3: 175 billion parameters = 700 GB (FP32)
  • LLaMA-70B: 70 billion parameters = 280 GB
  • Stable Diffusion: 890M parameters = 3.6 GB
FP16 weight: 16 bits (1 sign, 5 exponent, 10 mantissa)
Q-Memory Storage:
- Cell 1: Sign + Exponent (6 bits) → Use 6 of 13 bits
- Cell 2: Mantissa (10 bits) → Use 10 of 13 bits
Efficiency: 16 bits / 2 cells = 8 bits per cell
Quantize weights to 8-bit integers: w ∈ [-128, 127]
Q-Memory Storage:
- Single cell stores 8-bit weight using 256 of 8,192 levels
- Remaining capacity for 32 weights per cell!
Efficiency: 256 bits per cell (with compression)
Store weight as analog resistance value
Advantages:
- No quantization loss
- Native analog multiply-accumulate
- 1 weight per cell, but analog computation
ModelWeightsDRAMFlashQ-Memory (Scheme C)Reduction
ResNet-5025M100 MB25 MB2 MB50×
BERT-Base110M440 MB110 MB8.5 MB52×
GPT-3175B700 GB175 GB13.5 GB52×
LLaMA-70B70B280 GB70 GB5.4 GB52×

Key Innovation: Compute directly in Q-Memory array without moving data to ALU.

Neural network layers perform: y = Wx

Classical Approach:

  1. Read W from memory → 100 ns × M×N elements
  2. Compute in ALU → 1 ns × M×N operations
  3. Total: ~100 ns × M×N (memory-bound!)

Q-Memory Analog Approach:

  1. Encode x as voltages on bitlines
  2. Each Q-Memory cell multiplies (voltage × conductance)
  3. Sum currents on wordlines → y
  4. Total: 10 ns (constant, regardless of M or N!)
OperationGPU (others)CPU (Xeon)DRAMQ-Memory AnalogSpeedup
1K×1K MVM100 ns1 μs500 ns10 ns10-100×
Energy (nJ)100050005005010-100×
4K×4K MVM400 ns16 μs8 μs40 ns10-400×
Energy (μJ)168080.810-100×

BERT Layer Specifications:

  • Input: 768 dimensions
  • Hidden: 3072 dimensions
  • Weight matrix: 768 × 3072 = 2.36M parameters

Classical GPU (Others):

  • Read weights: 9.44 MB
  • Memory bandwidth: 2 TB/s
  • Time to read: 4.7 μs
  • Compute time: 0.015 μs
  • Total: 4.7 μs (memory-bound)

Q-Memory Analog:

  • Weights pre-stored in crossbar
  • Parallel MVM: 10 ns (all outputs at once!)
  • Speedup: 470× over GPU!

Training requires gradient computation and weight updates.

Forward Pass: y = Wx
Backward Pass: δ = W^T ε (W transpose × error)
Q-Memory Implementation:
1. Store W in crossbar (forward direction)
2. Transpose by swapping rows/columns (no data movement!)
3. Apply error ε to wordlines → δ on bitlines
Benefits:
- No weight reload
- Same 10 ns latency for backward pass
- Zero energy for transpose operation
def train_epoch(model, data, Q-Memory_array):
for batch in data:
# Forward pass (in Q-Memory)
activations = []
x = batch.input
for layer in model.layers:
y = Q-Memory_array.forward_MVM(x, layer_id) # 10 ns
activations.append(y)
x = activation_function(y) # 1 ns
# Compute loss
loss = compute_loss(x, batch.target) # 100 ns
# Backward pass (in Q-Memory)
error = loss_gradient(loss)
for layer in reversed(model.layers):
delta = Q-Memory_array.backward_MVM(error, layer_id) # 10 ns
# Update weights (in-place!)
gradient = outer_product(activations[layer], error) # 50 ns
Q-Memory_array.update_weights(layer_id, gradient, lr) # 200 ns
error = delta
return model
# Total time per layer: 220 ns (vs. 5+ μs on GPU!)
ModelLayersGPU Time/EpochQ-Memory Time/EpochSpeedup
ResNet-5050100 seconds2 seconds50×
BERT-Base12300 seconds0.66 seconds450×
GPT-2481200 seconds2.64 seconds450×
OperationGPU (Joules)CPU (Joules)Q-Memory (Joules)Reduction
ResNet-50 Epoch1000500010100-500×
BERT Epoch3000150006.6450-2300×

Network: MNIST (784-256-128-10), Batch size: 64

Classical GPU (others):

  • Forward pass: 50 μs
  • Backward pass: 50 μs
  • Weight update: 10 μs
  • Total per batch: 110 μs
  • Throughput: 581,000 images/sec

Q-Memory Accelerator:

  • Layer 0-2 forward: 1920 ns (pipelined)
  • Activation: 192 ns
  • Backward: 1920 ns
  • Weight update: 600 ns
  • Total per batch: 4.23 μs
  • Throughput: 15.1 million images/sec

Speedup: 26× over other GPUs!

  • 52× storage reduction for neural network weights
  • 10-100× faster matrix-vector multiplication
  • 20-500× energy reduction for training
  • In-memory computation eliminates data movement
  • Scalable to billion-parameter models