Neural Network Architectures

What Are Neural Networks?

Neural networks are a class of machine learning models inspired by biological neurons. They consist of interconnected computational nodes (neurons) that learn patterns in data by adjusting the weights between connections. Since Rosenblatt's Perceptron in 1958, neural networks have gone through multiple waves: backpropagation in the 1980s, AlexNet sparking the deep learning revolution in 2012, the Transformer architecture revolutionizing NLP in 2017, and diffusion models breaking through in generative AI in the 2020s.

Different tasks demand different architectures: CNNs for image recognition, RNNs/LSTMs for sequential data, Transformers for language and multimodal tasks, and GANs or diffusion models for generation. This guide systematically covers the principles, core components, code implementations, and best use cases for each architecture.

Feedforward Neural Network (FNN)

The Feedforward Neural Network is the most fundamental neural network architecture. Data flows in one direction from input layer through one or more hidden layers to the output layer, with no cyclic connections. Each layer's neurons are fully connected to the next layer, with activation functions (e.g., ReLU, Sigmoid) introducing non-linearity.

FNNs are suitable for simple classification and regression tasks on structured data, such as tabular data prediction and credit scoring. When data has spatial structure (images) or temporal characteristics, CNN or RNN should be used instead.

Key Pipeline

Input Hidden Layer(s) Activation (ReLU) Output

PyTorch

import torch
import torch.nn as nn

class FNN(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, out_dim),
        )
    def forward(self, x):
        return self.net(x)

model = FNN(784, 256, 10)  # e.g. MNIST digits
Tabular ClassificationRegressionCredit ScoringRecommender Embeddings

Convolutional Neural Network (CNN)

CNNs use convolutional filters (kernels) that slide across input data to automatically extract local features like edges, textures, and shapes. Convolutional layers use weight sharing to drastically reduce parameter count; pooling layers reduce spatial dimensions and enhance translation invariance. By stacking multiple convolutional layers, the network progressively abstracts from low-level features (edges) to high-level semantic features (object parts, full objects).

CNNs are the foundational architecture for computer vision, widely used in image classification, object detection, and semantic segmentation. While Vision Transformers (ViT) have surpassed CNNs on some tasks in recent years, CNNs remain important due to their efficiency and mature tooling.

Core Component Pipeline

Conv2d BatchNorm ReLU MaxPool Conv2d ... Flatten FC (Linear)

Famous CNN Models

Model Year Key Innovation Layers Params
LeNet-51998First practical CNN, handwriting recognition560K
AlexNet2012ReLU + Dropout + GPU training, sparked DL revolution860M
VGG-162014Uniform 3x3 small kernels, deeper networks16138M
ResNet-502015Residual connections (skip connections), solved vanishing gradient5025M
EfficientNet2019Compound scaling (depth/width/resolution)~825-66M

PyTorch

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )
    def forward(self, x):
        return self.classifier(self.features(x))
Image ClassificationObject Detection (YOLO)Semantic SegmentationFace RecognitionMedical Imaging

RNN / LSTM / GRU

Recurrent Neural Networks (RNN) are designed for sequential data. The hidden state h(t) at each time step depends on both the current input x(t) and the previous hidden state h(t-1), capturing temporal dependencies in sequences. However, standard RNNs suffer from severe vanishing/exploding gradient problems on long sequences.

LSTM (Long Short-Term Memory) introduces three gating mechanisms (forget gate, input gate, output gate) and an independent cell state, effectively solving the long-range dependency problem. GRU (Gated Recurrent Unit) is a simplified variant that merges the forget and input gates into a single "update gate," requiring fewer parameters and training faster, while achieving comparable performance on many tasks.

RNN vs LSTM vs GRU Comparison

Feature RNN LSTM GRU
Gating None 3 gates (forget/input/output) 2 gates (reset/update)
Long-range Deps Poor (vanishing gradient) Excellent Good
Parameters Fewest Most (4x hidden) Medium (3x hidden)
Training Speed Fast Slow Medium
Best For Short sequences Long text, speech Medium sequences, limited resources

PyTorch

import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden, num_classes):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden, num_layers=2,
                            batch_first=True, dropout=0.3)
        self.fc = nn.Linear(hidden, num_classes)

    def forward(self, x):
        x = self.emb(x)                    # (B, T, E)
        _, (h_n, _) = self.lstm(x)          # h_n: (2, B, H)
        out = self.fc(h_n[-1])              # last layer hidden
        return out
Time Series ForecastingText Generation (pre-Transformer)Speech RecognitionMachine Translation (Seq2Seq)Sentiment Analysis

Transformer

The Transformer was proposed by Google in the 2017 paper "Attention Is All You Need." It completely eliminates recurrence, relying solely on the self-attention mechanism to model dependencies between any positions in a sequence. This enables highly parallel training, solving the sequential bottleneck of RNNs.

At its core is the Multi-Head Attention mechanism: input is projected into Query, Key, and Value matrices, and scaled dot-product attention computes how much each position should attend to every other position. Positional Encoding injects sequence order information, since self-attention itself is permutation-invariant.

Core Formula

Attention(Q, K, V) = softmax(QKT / √dk) · V

Architecture Components

Input Embedding Positional Encoding Multi-Head Attention Add & Norm Feed-Forward Add & Norm Output

Transformer Variants

Variant Structure Models Primary Task Params
Encoder-only Encoder only, bidirectional attention BERT, RoBERTa, DeBERTa Understanding (classification, NER, QA) 110M-340M
Decoder-only Decoder only, causal (unidirectional) attention GPT-4, Claude, LLaMA, Gemini Generation (chat, code, reasoning) 7B-1.8T
Encoder-Decoder Full enc-dec, cross attention T5, BART, mBART Translation, summarization, Seq2Seq 220M-11B
Vision Transformer Image patches + Transformer encoder ViT, DeiT, Swin Transformer Image classification, detection 86M-632M

PyTorch โ€” Self-Attention

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape
        q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        k = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        v = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        attn = (q @ k.transpose(-2,-1)) / math.sqrt(self.d_k)
        attn = torch.softmax(attn, dim=-1)
        out = (attn @ v).transpose(1,2).contiguous().view(B, T, C)
        return self.W_o(out)

# Usage: attn = SelfAttention(d_model=512, n_heads=8)
Large Language Models (LLM)Machine TranslationText SummarizationCode GenerationVision (ViT)Multimodal AI

GAN (Generative Adversarial Network)

Proposed by Ian Goodfellow in 2014, GANs consist of two competing networks: a Generator that attempts to produce realistic data from random noise, and a Discriminator that tries to distinguish real data from generated data. They are trained through a minimax game, until the generator produces samples indistinguishable from real data.

Training instability is the main challenge of GANs, with issues like mode collapse and oscillation. Techniques such as WGAN and Spectral Normalization have partially addressed these problems.

GAN Variants

Variant Key Innovation Application
DCGANConvolutional layers replace FC layersImage generation
StyleGAN (1/2/3)Style mapping network, layer-wise controlHigh-quality face generation
CycleGANUnpaired image-to-image translation (cycle consistency)Style transfer, season conversion
Pix2PixPaired conditional image generationImage translation (sketch to photo)
WGANWasserstein distance replaces JS divergenceMore stable training

PyTorch โ€” Simple GAN

import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, z_dim=100, img_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, img_dim), nn.Tanh(),
        )
    def forward(self, z):
        return self.net(z)

class Discriminator(nn.Module):
    def __init__(self, img_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(img_dim, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 1), nn.Sigmoid(),
        )
    def forward(self, x):
        return self.net(x)
Image GenerationStyle TransferData AugmentationSuper ResolutionImage Inpainting

Diffusion Models

Diffusion models are a class of probabilistic generative models based on two processes: the forward process gradually adds Gaussian noise to data until it becomes pure noise; the reverse process learns to progressively denoise to recover the original data. By parameterizing a denoising network (typically a U-Net), the model learns to predict and remove noise at each step.

Compared to GANs, diffusion models offer more stable training, higher generation quality, and better diversity, but slower inference (requiring multiple denoising steps). Sampling strategies like DDPM and DDIM can accelerate inference.

Representative Models

Model Architecture Features
Stable DiffusionLatent Diffusion (LDM) + U-Net + CLIPOpen source, text-to-image, fine-tunable
DALL-E 2/3CLIP + diffusion prior + decoderHigh-quality text-to-image
MidjourneyProprietary diffusion architectureStrong artistic style
SoraDiffusion Transformer (DiT)Text-to-video generation
Text-to-ImageText-to-VideoImage EditingSuper Resolution3D GenerationAudio Synthesis

Architecture Comparison

Architecture Best For Key Innovation Params Range Year
FNN Tabular classification/regression Fully connected + backpropagation 1K - 10M 1986
CNN Image/vision tasks Convolution + weight sharing + pooling 60K - 138M 1998
RNN/LSTM Short/medium sequences Recurrent connections + gating 100K - 50M 1997
Transformer NLP, multimodal, long sequences Self-attention + positional encoding 110M - 1.8T 2017
GAN Image generation/style transfer Generator-discriminator adversarial training 1M - 200M 2014
Diffusion High-quality image/video generation Iterative denoising + probabilistic modeling 100M - 8B 2020

Architecture Selection Guide

Choose the most suitable neural network architecture based on your task:

Task Recommended Architecture Recommended Models
Image Classification CNN / ViT ResNet, EfficientNet, ViT, ConvNeXt
Object Detection CNN YOLOv8, Faster R-CNN, DETR
Semantic Segmentation CNN / Transformer U-Net, SegFormer, Mask2Former
Text Classification / NLU Transformer (Encoder) BERT, RoBERTa, DeBERTa
Text Generation / Chat Transformer (Decoder) GPT-4, Claude, LLaMA 3, Gemini
Machine Translation Transformer (Enc-Dec) T5, mBART, NLLB
Time Series Forecasting LSTM / Transformer LSTM, Temporal Fusion Transformer, PatchTST
Image Generation Diffusion / GAN Stable Diffusion, DALL-E 3, StyleGAN
Video Generation Diffusion Transformer Sora, Runway Gen-3, Kling
Multimodal Understanding Vision Transformer CLIP, LLaVA, GPT-4V, Gemini
Speech Recognition Transformer Whisper, Wav2Vec 2.0, Conformer
Recommendation System FNN / Transformer DeepFM, DLRM, SASRec

Related Resources

Dive deeper into machine learning and deep learning frameworks:

Frequently Asked Questions

Is CNN or Transformer better for image tasks?

Both have advantages. CNNs perform better on smaller datasets due to their inductive biases (locality, translation invariance) providing good priors; they are also more efficient for training and inference. Vision Transformers (ViT) typically outperform CNNs on large-scale datasets (ImageNet-21K, JFT-300M) because self-attention captures global dependencies. The current trend is hybrid architectures (e.g., ConvNeXt borrows Transformer design principles while using convolutions) and fine-tuning pretrained ViTs on smaller datasets.

Why did Transformers replace RNNs as the dominant NLP architecture?

Three main reasons: (1) Parallelization -- RNNs must process sequentially, while Transformer self-attention allows all positions to be computed simultaneously, providing orders-of-magnitude speedup in training; (2) Long-range dependencies -- self-attention directly models relationships between any two positions regardless of distance (theoretically), whereas LSTM still forgets on extremely long sequences despite gating; (3) Scalability -- the Transformer architecture continues improving performance as parameters scale to hundreds of billions (scaling law), something RNNs cannot match.

Which is better: GAN or Diffusion Models?

Diffusion models have comprehensively surpassed GANs in generation quality and diversity, especially for text-conditioned generation. However, GANs still have an advantage in inference speed (single forward pass vs. 20-50 denoising steps for diffusion). For real-time applications or resource-constrained scenarios, GANs may still be preferable. That said, distillation techniques (Consistency Models, LCM) are drastically reducing diffusion model inference steps, approaching real-time. In 2026, if speed is not a constraint, diffusion models are the preferred choice for image generation.

Which architecture should beginners learn first?

Recommended learning order: (1) Start with FNN (fully connected networks) to understand forward propagation, backpropagation, and gradient descent fundamentals; (2) Then learn CNN to understand convolution, pooling, and feature extraction -- practice with MNIST/CIFAR-10; (3) Learn basic RNN/LSTM concepts (even though Transformers dominate, understanding recurrent structure helps grasp the essence of sequence modeling); (4) Finally, dive deep into Transformers, the most important architecture today. PyTorch is recommended as the learning framework due to its intuitive code and easier debugging.

How do I choose the right model size (parameter count)?

Model size depends on three factors: (1) Data volume -- according to Chinchilla scaling laws, optimal training tokens should be approximately 20x the parameter count; insufficient data can lead to overfitting with larger models; (2) Compute resources -- a 7B parameter model requires ~14GB VRAM for inference (FP16), 70B requires ~140GB needing multi-GPU parallelism; (3) Task complexity -- simple classification tasks work fine with BERT-base (110M), while complex reasoning may need 70B+ models. The practical recommendation is to start small, evaluate on a validation set, and scale up only if performance is insufficient.