What Are Neural Networks?

Neural networks are a class of machine learning models inspired by biological neurons. They consist of interconnected computational nodes (neurons) that learn patterns in data by adjusting the weights between connections. Since Rosenblatt's Perceptron in 1958, neural networks have gone through multiple waves: backpropagation in the 1980s, AlexNet sparking the deep learning revolution in 2012, the Transformer architecture revolutionizing NLP in 2017, and diffusion models breaking through in generative AI in the 2020s.

Different tasks demand different architectures: CNNs for image recognition, RNNs/LSTMs for sequential data, Transformers for language and multimodal tasks, and GANs or diffusion models for generation. This guide systematically covers the principles, core components, code implementations, and best use cases for each architecture.

Feedforward Neural Network (FNN)

The Feedforward Neural Network is the most fundamental neural network architecture. Data flows in one direction from input layer through one or more hidden layers to the output layer, with no cyclic connections. Each layer's neurons are fully connected to the next layer, with activation functions (e.g., ReLU, Sigmoid) introducing non-linearity.

FNNs are suitable for simple classification and regression tasks on structured data, such as tabular data prediction and credit scoring. When data has spatial structure (images) or temporal characteristics, CNN or RNN should be used instead.

Key Pipeline

Input→ Hidden Layer(s)→ Activation (ReLU)→ Output

PyTorch

import torch
import torch.nn as nn

class FNN(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, out_dim),
        )
    def forward(self, x):
        return self.net(x)

model = FNN(784, 256, 10)  # e.g. MNIST digits

Tabular ClassificationRegressionCredit ScoringRecommender Embeddings

Convolutional Neural Network (CNN)

CNNs use convolutional filters (kernels) that slide across input data to automatically extract local features like edges, textures, and shapes. Convolutional layers use weight sharing to drastically reduce parameter count; pooling layers reduce spatial dimensions and enhance translation invariance. By stacking multiple convolutional layers, the network progressively abstracts from low-level features (edges) to high-level semantic features (object parts, full objects).

CNNs are the foundational architecture for computer vision, widely used in image classification, object detection, and semantic segmentation. While Vision Transformers (ViT) have surpassed CNNs on some tasks in recent years, CNNs remain important due to their efficiency and mature tooling.

Core Component Pipeline

Conv2d→ BatchNorm→ ReLU→ MaxPool→ Conv2d→ ...→ Flatten→ FC (Linear)

Famous CNN Models

Model	Year	Key Innovation	Layers	Params
LeNet-5	1998	First practical CNN, handwriting recognition	5	60K
AlexNet	2012	ReLU + Dropout + GPU training, sparked DL revolution	8	60M
VGG-16	2014	Uniform 3x3 small kernels, deeper networks	16	138M
ResNet-50	2015	Residual connections (skip connections), solved vanishing gradient	50	25M
EfficientNet	2019	Compound scaling (depth/width/resolution)	~82	5-66M

PyTorch

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )
    def forward(self, x):
        return self.classifier(self.features(x))

Image ClassificationObject Detection (YOLO)Semantic SegmentationFace RecognitionMedical Imaging

RNN / LSTM / GRU

Recurrent Neural Networks (RNN) are designed for sequential data. The hidden state h(t) at each time step depends on both the current input x(t) and the previous hidden state h(t-1), capturing temporal dependencies in sequences. However, standard RNNs suffer from severe vanishing/exploding gradient problems on long sequences.

LSTM (Long Short-Term Memory) introduces three gating mechanisms (forget gate, input gate, output gate) and an independent cell state, effectively solving the long-range dependency problem. GRU (Gated Recurrent Unit) is a simplified variant that merges the forget and input gates into a single "update gate," requiring fewer parameters and training faster, while achieving comparable performance on many tasks.

RNN vs LSTM vs GRU Comparison

Feature	RNN	LSTM	GRU
Gating	None	3 gates (forget/input/output)	2 gates (reset/update)
Long-range Deps	Poor (vanishing gradient)	Excellent	Good
Parameters	Fewest	Most (4x hidden)	Medium (3x hidden)
Training Speed	Fast	Slow	Medium
Best For	Short sequences	Long text, speech	Medium sequences, limited resources

PyTorch

import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden, num_classes):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden, num_layers=2,
                            batch_first=True, dropout=0.3)
        self.fc = nn.Linear(hidden, num_classes)

    def forward(self, x):
        x = self.emb(x)                    # (B, T, E)
        _, (h_n, _) = self.lstm(x)          # h_n: (2, B, H)
        out = self.fc(h_n[-1])              # last layer hidden
        return out

Time Series ForecastingText Generation (pre-Transformer)Speech RecognitionMachine Translation (Seq2Seq)Sentiment Analysis

Transformer

The Transformer was proposed by Google in the 2017 paper "Attention Is All You Need." It completely eliminates recurrence, relying solely on the self-attention mechanism to model dependencies between any positions in a sequence. This enables highly parallel training, solving the sequential bottleneck of RNNs.

At its core is the Multi-Head Attention mechanism: input is projected into Query, Key, and Value matrices, and scaled dot-product attention computes how much each position should attend to every other position. Positional Encoding injects sequence order information, since self-attention itself is permutation-invariant.

Core Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Architecture Components

Input Embedding→ Positional Encoding→ Multi-Head Attention→ Add & Norm→ Feed-Forward→ Add & Norm→ Output

Transformer Variants

Variant	Structure	Models	Primary Task	Params
Encoder-only	Encoder only, bidirectional attention	BERT, RoBERTa, DeBERTa	Understanding (classification, NER, QA)	110M-340M
Decoder-only	Decoder only, causal (unidirectional) attention	GPT-4, Claude, LLaMA, Gemini	Generation (chat, code, reasoning)	7B-1.8T
Encoder-Decoder	Full enc-dec, cross attention	T5, BART, mBART	Translation, summarization, Seq2Seq	220M-11B
Vision Transformer	Image patches + Transformer encoder	ViT, DeiT, Swin Transformer	Image classification, detection	86M-632M

PyTorch — Self-Attention

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape
        q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        k = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
        v = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        attn = (q @ k.transpose(-2,-1)) / math.sqrt(self.d_k)
        attn = torch.softmax(attn, dim=-1)
        out = (attn @ v).transpose(1,2).contiguous().view(B, T, C)
        return self.W_o(out)

# Usage: attn = SelfAttention(d_model=512, n_heads=8)

Large Language Models (LLM)Machine TranslationText SummarizationCode GenerationVision (ViT)Multimodal AI

GAN (Generative Adversarial Network)

Proposed by Ian Goodfellow in 2014, GANs consist of two competing networks: a Generator that attempts to produce realistic data from random noise, and a Discriminator that tries to distinguish real data from generated data. They are trained through a minimax game, until the generator produces samples indistinguishable from real data.

Training instability is the main challenge of GANs, with issues like mode collapse and oscillation. Techniques such as WGAN and Spectral Normalization have partially addressed these problems.

GAN Variants

Variant	Key Innovation	Application
DCGAN	Convolutional layers replace FC layers	Image generation
StyleGAN (1/2/3)	Style mapping network, layer-wise control	High-quality face generation
CycleGAN	Unpaired image-to-image translation (cycle consistency)	Style transfer, season conversion
Pix2Pix	Paired conditional image generation	Image translation (sketch to photo)
WGAN	Wasserstein distance replaces JS divergence	More stable training

PyTorch — Simple GAN

import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, z_dim=100, img_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, img_dim), nn.Tanh(),
        )
    def forward(self, z):
        return self.net(z)

class Discriminator(nn.Module):
    def __init__(self, img_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(img_dim, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 1), nn.Sigmoid(),
        )
    def forward(self, x):
        return self.net(x)

Image GenerationStyle TransferData AugmentationSuper ResolutionImage Inpainting

Diffusion Models

Diffusion models are a class of probabilistic generative models based on two processes: the forward process gradually adds Gaussian noise to data until it becomes pure noise; the reverse process learns to progressively denoise to recover the original data. By parameterizing a denoising network (typically a U-Net), the model learns to predict and remove noise at each step.

Compared to GANs, diffusion models offer more stable training, higher generation quality, and better diversity, but slower inference (requiring multiple denoising steps). Sampling strategies like DDPM and DDIM can accelerate inference.

Representative Models

Model	Architecture	Features
Stable Diffusion	Latent Diffusion (LDM) + U-Net + CLIP	Open source, text-to-image, fine-tunable
DALL-E 2/3	CLIP + diffusion prior + decoder	High-quality text-to-image
Midjourney	Proprietary diffusion architecture	Strong artistic style
Sora	Diffusion Transformer (DiT)	Text-to-video generation

Text-to-ImageText-to-VideoImage EditingSuper Resolution3D GenerationAudio Synthesis

Architecture Comparison

Architecture	Best For	Key Innovation	Params Range	Year
FNN	Tabular classification/regression	Fully connected + backpropagation	1K - 10M	1986
CNN	Image/vision tasks	Convolution + weight sharing + pooling	60K - 138M	1998
RNN/LSTM	Short/medium sequences	Recurrent connections + gating	100K - 50M	1997
Transformer	NLP, multimodal, long sequences	Self-attention + positional encoding	110M - 1.8T	2017
GAN	Image generation/style transfer	Generator-discriminator adversarial training	1M - 200M	2014
Diffusion	High-quality image/video generation	Iterative denoising + probabilistic modeling	100M - 8B	2020

Architecture Selection Guide

Choose the most suitable neural network architecture based on your task:

Task	Recommended Architecture	Recommended Models
Image Classification	CNN / ViT	ResNet, EfficientNet, ViT, ConvNeXt
Object Detection	CNN	YOLOv8, Faster R-CNN, DETR
Semantic Segmentation	CNN / Transformer	U-Net, SegFormer, Mask2Former
Text Classification / NLU	Transformer (Encoder)	BERT, RoBERTa, DeBERTa
Text Generation / Chat	Transformer (Decoder)	GPT-4, Claude, LLaMA 3, Gemini
Machine Translation	Transformer (Enc-Dec)	T5, mBART, NLLB
Time Series Forecasting	LSTM / Transformer	LSTM, Temporal Fusion Transformer, PatchTST
Image Generation	Diffusion / GAN	Stable Diffusion, DALL-E 3, StyleGAN
Video Generation	Diffusion Transformer	Sora, Runway Gen-3, Kling
Multimodal Understanding	Vision Transformer	CLIP, LLaVA, GPT-4V, Gemini
Speech Recognition	Transformer	Whisper, Wav2Vec 2.0, Conformer
Recommendation System	FNN / Transformer	DeepFM, DLRM, SASRec

Related Resources

Dive deeper into machine learning and deep learning frameworks:

ML Algorithms Guide PyTorch Cheatsheet TensorFlow Cheatsheet

Frequently Asked Questions

Is CNN or Transformer better for image tasks?

Both have advantages. CNNs perform better on smaller datasets due to their inductive biases (locality, translation invariance) providing good priors; they are also more efficient for training and inference. Vision Transformers (ViT) typically outperform CNNs on large-scale datasets (ImageNet-21K, JFT-300M) because self-attention captures global dependencies. The current trend is hybrid architectures (e.g., ConvNeXt borrows Transformer design principles while using convolutions) and fine-tuning pretrained ViTs on smaller datasets.

Why did Transformers replace RNNs as the dominant NLP architecture?

Three main reasons: (1) Parallelization -- RNNs must process sequentially, while Transformer self-attention allows all positions to be computed simultaneously, providing orders-of-magnitude speedup in training; (2) Long-range dependencies -- self-attention directly models relationships between any two positions regardless of distance (theoretically), whereas LSTM still forgets on extremely long sequences despite gating; (3) Scalability -- the Transformer architecture continues improving performance as parameters scale to hundreds of billions (scaling law), something RNNs cannot match.

Which is better: GAN or Diffusion Models?

Diffusion models have comprehensively surpassed GANs in generation quality and diversity, especially for text-conditioned generation. However, GANs still have an advantage in inference speed (single forward pass vs. 20-50 denoising steps for diffusion). For real-time applications or resource-constrained scenarios, GANs may still be preferable. That said, distillation techniques (Consistency Models, LCM) are drastically reducing diffusion model inference steps, approaching real-time. In 2026, if speed is not a constraint, diffusion models are the preferred choice for image generation.

Which architecture should beginners learn first?

Recommended learning order: (1) Start with FNN (fully connected networks) to understand forward propagation, backpropagation, and gradient descent fundamentals; (2) Then learn CNN to understand convolution, pooling, and feature extraction -- practice with MNIST/CIFAR-10; (3) Learn basic RNN/LSTM concepts (even though Transformers dominate, understanding recurrent structure helps grasp the essence of sequence modeling); (4) Finally, dive deep into Transformers, the most important architecture today. PyTorch is recommended as the learning framework due to its intuitive code and easier debugging.

How do I choose the right model size (parameter count)?

Model size depends on three factors: (1) Data volume -- according to Chinchilla scaling laws, optimal training tokens should be approximately 20x the parameter count; insufficient data can lead to overfitting with larger models; (2) Compute resources -- a 7B parameter model requires ~14GB VRAM for inference (FP16), 70B requires ~140GB needing multi-GPU parallelism; (3) Task complexity -- simple classification tasks work fine with BERT-base (110M), while complex reasoning may need 70B+ models. The practical recommendation is to start small, evaluate on a validation set, and scale up only if performance is insufficient.

Neural Network Architectures

What Are Neural Networks?

Feedforward Neural Network (FNN)

Key Pipeline

PyTorch

Convolutional Neural Network (CNN)

Core Component Pipeline

Famous CNN Models

PyTorch

RNN / LSTM / GRU

RNN vs LSTM vs GRU Comparison

PyTorch

Transformer

Core Formula

Architecture Components

Transformer Variants

PyTorch — Self-Attention

GAN (Generative Adversarial Network)

GAN Variants

PyTorch — Simple GAN

Diffusion Models

Representative Models

Architecture Comparison

Architecture Selection Guide

Related Resources

Frequently Asked Questions

💬 Comments

Neural Network Architectures

What Are Neural Networks?

Feedforward Neural Network (FNN)

Key Pipeline

PyTorch

Convolutional Neural Network (CNN)

Core Component Pipeline

Famous CNN Models

PyTorch

RNN / LSTM / GRU

RNN vs LSTM vs GRU Comparison

PyTorch

Transformer

Core Formula

Architecture Components

Transformer Variants

PyTorch — Self-Attention

GAN (Generative Adversarial Network)

GAN Variants

PyTorch — Simple GAN

Diffusion Models

Representative Models

Architecture Comparison

Architecture Selection Guide

Related Resources

Frequently Asked Questions

Related Tools

Popular Tools

Explore More Tools

💬 Comments