Comparación de funciones de activación en redes neuronales
Entrenamos 4 MLPs idénticos con Sigmoid, Tanh, ReLU y Leaky ReLU sobre el dataset Wine para comparar convergencia, accuracy y estabilidad.
🔀 Comparación de Funciones de Activación en Redes Neuronales
Submódulo: Perceptrón Multicapa (MLP) · Fundamentos de Deep Learning
Las funciones de activación son un componente clave de las redes neuronales. Introducen la no linealidad necesaria para que la red aprenda patrones complejos.
En este caso de uso:
- Exploramos el dataset Wine (3 clases de vinos, 13 propiedades químicas)
- Implementamos un MLP desde cero con NumPy (32→16→3)
- Entrenamos 4 modelos idénticos, cada uno con una función de activación diferente
- Comparamos convergencia, accuracy y estabilidad
| Función | Rango | Ventaja | Inconveniente |
|---|---|---|---|
| Sigmoid | $(0, 1)$ | Salida probabilística | Vanishing gradient |
| Tanh | $(-1, 1)$ | Centrada en 0 | Vanishing gradient |
| ReLU | $[0, \infty)$ | Rápida, sin vanishing | Dying ReLU |
| Leaky ReLU | $(-\infty, \infty)$ | Sin dying ReLU | Hiperparámetro $\alpha$ |
1. Exploración del Dataset Wine
El dataset Wine contiene 178 muestras de 3 variedades de vino de Italia, cada una descrita por 13 propiedades químicas (alcohol, ácido málico, ceniza, alcalinidad, magnesio, fenoles, flavonoides, etc.).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
# ── Cargar dataset ────────────────────────────────────────
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
print(f"Dataset: {df.shape[0]} muestras, {df.shape[1]-1} features, {len(data.target_names)} clases")
print(f"Clases: {list(data.target_names)}")
print(f"\nMuestras por clase:")
print(df['target'].value_counts().sort_index().to_string())
# ── Matriz de correlación ────────────────────────────────
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt='.1f', cmap='RdBu_r', center=0,
square=True, linewidths=0.5, annot_kws={'size': 7})
plt.title('Correlación entre variables', fontsize=14)
plt.tight_layout()
plt.show()
Descripción del dataset:
.. _wine_dataset:
Wine recognition dataset
------------------------
**Data Set Characteristics:**
:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
- class:
- class_0
- class_1
- class_2
:Summary Statistics:
============================= ==== ===== ======= =====
Min Max Mean SD
============================= ==== ===== ======= =====
Alcohol: 11.0 14.8 13.0 0.8
Malic Acid: 0.74 5.80 2.34 1.12
Ash: 1.36 3.23 2.36 0.27
Alcalinity of Ash: 10.6 30.0 19.5 3.3
Magnesium: 70.0 162.0 99.7 14.3
Total Phenols: 0.98 3.88 2.29 0.63
Flavanoids: 0.34 5.08 2.03 1.00
Nonflavanoid Phenols: 0.13 0.66 0.36 0.12
Proanthocyanins: 0.41 3.58 1.59 0.57
Colour Intensity: 1.3 13.0 5.1 2.3
Hue: 0.48 1.71 0.96 0.23
OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71
Proline: 278 1680 746 315
============================= ==== ===== ======= =====
:Missing Attribute Values: None
:Class Distribution: class_0 (59), class_1 (71), class_2 (48)
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.
Original Owners:
Forina, M. et al, PARVUS -
An Extendible Package for Data Exploration, Classification and Correlation.
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.
Citation:
Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
.. dropdown:: References
(1) S. Aeberhard, D. Coomans and O. de Vel,
Comparison of Classifiers in High Dimensional Settings,
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Technometrics).
The data was used with many others for comparing various
classifiers. The classes are separable, though only RDA
has achieved 100% correct classification.
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
(All results using the leave-one-out technique)
(2) S. Aeberhard, D. Coomans and O. de Vel,
"THE CLASSIFICATION PERFORMANCE OF RDA"
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).
Primeras filas del dataset:
| alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0 |
| 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0 |
| 2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0 |
| 3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0 |
| 4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0 |
1.1 Distribución de las variables por clase
Algunas variables como flavanoids, proline y color_intensity muestran una buena separación entre clases, lo que facilitará la clasificación.
# ── Distribuciones por clase ──────────────────────────────
fig, axes = plt.subplots(4, 4, figsize=(16, 12))
for i, col in enumerate(data.feature_names):
ax = axes.flat[i]
for cls in range(3):
mask = df['target'] == cls
ax.hist(df.loc[mask, col], bins=15, alpha=0.5,
label=data.target_names[cls])
ax.set_title(col, fontsize=9)
ax.legend(fontsize=6)
ax.grid(alpha=0.3)
# Ocultar subplot extra
for j in range(len(data.feature_names), 16):
axes.flat[j].set_visible(False)
plt.suptitle('Distribución de cada variable por clase', fontsize=14)
plt.tight_layout()
plt.show()
2. Funciones de Activación y sus Derivadas
Implementamos las 4 funciones de activación junto con sus derivadas (necesarias para backpropagation):
- Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$, derivada: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$
- Tanh: $\tanh(x)$, derivada: $1 - \tanh^2(x)$
- ReLU: $\max(0, x)$, derivada: $\begin{cases} 1 & x > 0 \ 0 & x \leq 0 \end{cases}$
- Leaky ReLU: $\begin{cases} x & x > 0 \ \alpha x & x \leq 0 \end{cases}$, con $\alpha = 0.01$
# ── Funciones de activación y derivadas ───────────────────
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
def tanh_act(x):
return np.tanh(x)
def tanh_derivative(x):
return 1 - np.tanh(x)**2
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
def leaky_relu_derivative(x, alpha=0.01):
return np.where(x > 0, 1.0, alpha)
# ── Visualizar las funciones ──────────────────────────────
x = np.linspace(-5, 5, 400)
fig, axes = plt.subplots(2, 4, figsize=(16, 6))
funcs = [
('Sigmoid', sigmoid, sigmoid_derivative),
('Tanh', tanh_act, tanh_derivative),
('ReLU', relu, relu_derivative),
('Leaky ReLU', leaky_relu, leaky_relu_derivative),
]
for i, (name, f, df) in enumerate(funcs):
# Función
axes[0, i].plot(x, f(x), lw=2, color='#6C5CE7')
axes[0, i].set_title(f'{name}', fontsize=12)
axes[0, i].axhline(0, color='gray', lw=0.5)
axes[0, i].axvline(0, color='gray', lw=0.5)
axes[0, i].grid(alpha=0.3)
# Derivada
axes[1, i].plot(x, df(x), lw=2, color='#00B894')
axes[1, i].set_title(f"{name}' (derivada)", fontsize=11)
axes[1, i].axhline(0, color='gray', lw=0.5)
axes[1, i].axvline(0, color='gray', lw=0.5)
axes[1, i].grid(alpha=0.3)
plt.suptitle('Funciones de activación y sus derivadas', fontsize=14)
plt.tight_layout()
plt.show()
3. Red Neuronal desde Cero con NumPy
Implementamos un MLP con la siguiente arquitectura:
Entrada (13) → Oculta 1 (32) → Oculta 2 (16) → Softmax (3)
La clase NeuralNetwork permite seleccionar la función de activación de las capas ocultas. Usa inicialización He, Cross-Entropy Loss y mini-batch SGD.
class NeuralNetwork:
"""
MLP de 2 capas ocultas con activación configurable.
Softmax en la salida para clasificación multiclase.
"""
def __init__(self, input_size, hidden_sizes, output_size,
activation='sigmoid', learning_rate=0.01, leaky_alpha=0.01):
self.lr = learning_rate
self.leaky_alpha = leaky_alpha
# Inicialización He
self.W1 = np.random.randn(input_size, hidden_sizes[0]) * np.sqrt(2. / input_size)
self.b1 = np.zeros((1, hidden_sizes[0]))
self.W2 = np.random.randn(hidden_sizes[0], hidden_sizes[1]) * np.sqrt(2. / hidden_sizes[0])
self.b2 = np.zeros((1, hidden_sizes[1]))
self.W3 = np.random.randn(hidden_sizes[1], output_size) * np.sqrt(2. / hidden_sizes[1])
self.b3 = np.zeros((1, output_size))
# Seleccionar función de activación
act_map = {
'sigmoid': (sigmoid, sigmoid_derivative),
'tanh': (tanh_act, tanh_derivative),
'relu': (relu, relu_derivative),
'leaky_relu': (
lambda x: leaky_relu(x, self.leaky_alpha),
lambda x: leaky_relu_derivative(x, self.leaky_alpha)
),
}
if activation not in act_map:
raise ValueError(f"Activación '{activation}' no reconocida")
self.activation, self.activation_derivative = act_map[activation]
def softmax(self, x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
def forward(self, X):
self.Z1 = X @ self.W1 + self.b1
self.A1 = self.activation(self.Z1)
self.Z2 = self.A1 @ self.W2 + self.b2
self.A2 = self.activation(self.Z2)
self.Z3 = self.A2 @ self.W3 + self.b3
self.A3 = self.softmax(self.Z3)
return self.A3
def compute_loss(self, y_true, y_pred):
m = y_true.shape[0]
log_lik = -np.log(y_pred[range(m), y_true] + 1e-9)
return np.sum(log_lik) / m
def backward(self, X, y_true):
m = X.shape[0]
# One-hot
y_oh = np.zeros_like(self.A3)
y_oh[np.arange(m), y_true] = 1
dZ3 = self.A3 - y_oh
dW3 = self.A2.T @ dZ3 / m
db3 = np.sum(dZ3, axis=0, keepdims=True) / m
dA2 = dZ3 @ self.W3.T
dZ2 = dA2 * self.activation_derivative(self.Z2)
dW2 = self.A1.T @ dZ2 / m
db2 = np.sum(dZ2, axis=0, keepdims=True) / m
dA1 = dZ2 @ self.W2.T
dZ1 = dA1 * self.activation_derivative(self.Z1)
dW1 = X.T @ dZ1 / m
db1 = np.sum(dZ1, axis=0, keepdims=True) / m
# Actualizar pesos
self.W3 -= self.lr * dW3; self.b3 -= self.lr * db3
self.W2 -= self.lr * dW2; self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1; self.b1 -= self.lr * db1
def train(self, X, y, epochs=20, batch_size=32, verbose=True):
history = {'loss': [], 'accuracy': []}
for epoch in range(epochs):
idx = np.random.permutation(X.shape[0])
X_s, y_s = X[idx], y[idx]
epoch_loss = 0
for i in range(0, X.shape[0], batch_size):
X_b, y_b = X_s[i:i+batch_size], y_s[i:i+batch_size]
y_pred = self.forward(X_b)
loss = self.compute_loss(y_b, y_pred)
epoch_loss += loss * X_b.shape[0]
self.backward(X_b, y_b)
epoch_loss /= X.shape[0]
acc = np.mean(self.predict(X) == y)
history['loss'].append(epoch_loss)
history['accuracy'].append(acc)
if verbose and (epoch + 1) % 5 == 0:
print(f" Epoch {epoch+1:>3}/{epochs} — Loss: {epoch_loss:.4f} — Acc: {acc*100:.1f}%")
return history
def predict(self, X):
return np.argmax(self.forward(X), axis=1)
print('✅ Clase NeuralNetwork definida (13 → 32 → 16 → 3)')
4. Entrenamiento con las 4 Funciones de Activación
Entrenamos 4 modelos idénticos (misma semilla, misma arquitectura) variando solo la función de activación de las capas ocultas.
# ── Preparación de datos ─────────────────────────────────
X = data.data
y = data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# ── Entrenamiento ─────────────────────────────────────────
activations = ['sigmoid', 'tanh', 'relu', 'leaky_relu']
histories = {}
models = {}
for act in activations:
print(f"\n{'═'*50}")
print(f" Activación: {act.upper()}")
print(f"{'═'*50}")
np.random.seed(42) # Misma inicialización para todos
model = NeuralNetwork(
input_size=X.shape[1],
hidden_sizes=[32, 16],
output_size=len(np.unique(y)),
activation=act,
learning_rate=0.05
)
history = model.train(X_scaled, y, epochs=30, batch_size=16)
histories[act] = history
models[act] = model
Entrenando modelo con activación: sigmoid Epoch 1/30, Loss: 1.0810 Epoch 2/30, Loss: 1.0384 Epoch 3/30, Loss: 1.0063 Epoch 4/30, Loss: 0.9662 Epoch 5/30, Loss: 0.9396 Epoch 6/30, Loss: 0.9046 Epoch 7/30, Loss: 0.8722 Epoch 8/30, Loss: 0.8415 Epoch 9/30, Loss: 0.8071 Epoch 10/30, Loss: 0.7719 Epoch 11/30, Loss: 0.7448 Epoch 12/30, Loss: 0.7059 Epoch 13/30, Loss: 0.6711 Epoch 14/30, Loss: 0.6399 Epoch 15/30, Loss: 0.6093 Epoch 16/30, Loss: 0.5820 Epoch 17/30, Loss: 0.5488 Epoch 18/30, Loss: 0.5210 Epoch 19/30, Loss: 0.4961 Epoch 20/30, Loss: 0.4712 Epoch 21/30, Loss: 0.4458 Epoch 22/30, Loss: 0.4251 Epoch 23/30, Loss: 0.4026 Epoch 24/30, Loss: 0.3832 Epoch 25/30, Loss: 0.3664 Epoch 26/30, Loss: 0.3475 Epoch 27/30, Loss: 0.3315 Epoch 28/30, Loss: 0.3157 Epoch 29/30, Loss: 0.3009 Epoch 30/30, Loss: 0.2894 Entrenando modelo con activación: tanh Epoch 1/30, Loss: 1.2838 Epoch 2/30, Loss: 0.5458 Epoch 3/30, Loss: 0.3700 Epoch 4/30, Loss: 0.2845 Epoch 5/30, Loss: 0.2304 Epoch 6/30, Loss: 0.1958 Epoch 7/30, Loss: 0.1665 Epoch 8/30, Loss: 0.1442 Epoch 9/30, Loss: 0.1283 Epoch 10/30, Loss: 0.1155 Epoch 11/30, Loss: 0.1037 Epoch 12/30, Loss: 0.0948 Epoch 13/30, Loss: 0.0869 Epoch 14/30, Loss: 0.0804 Epoch 15/30, Loss: 0.0747 Epoch 16/30, Loss: 0.0699 Epoch 17/30, Loss: 0.0653 Epoch 18/30, Loss: 0.0612 Epoch 19/30, Loss: 0.0573 Epoch 20/30, Loss: 0.0541 Epoch 21/30, Loss: 0.0510 Epoch 22/30, Loss: 0.0485 Epoch 23/30, Loss: 0.0471 Epoch 24/30, Loss: 0.0441 Epoch 25/30, Loss: 0.0415 Epoch 26/30, Loss: 0.0384 Epoch 27/30, Loss: 0.0366 Epoch 28/30, Loss: 0.0350 Epoch 29/30, Loss: 0.0336 Epoch 30/30, Loss: 0.0319 Entrenando modelo con activación: relu Epoch 1/30, Loss: 0.9881 Epoch 2/30, Loss: 0.4102 Epoch 3/30, Loss: 0.2737 Epoch 4/30, Loss: 0.1973 Epoch 5/30, Loss: 0.1461 Epoch 6/30, Loss: 0.1174 Epoch 7/30, Loss: 0.0967 Epoch 8/30, Loss: 0.0854 Epoch 9/30, Loss: 0.0695 Epoch 10/30, Loss: 0.0607 Epoch 11/30, Loss: 0.0529 Epoch 12/30, Loss: 0.0471 Epoch 13/30, Loss: 0.0418 Epoch 14/30, Loss: 0.0375 Epoch 15/30, Loss: 0.0342 Epoch 16/30, Loss: 0.0313 Epoch 17/30, Loss: 0.0291 Epoch 18/30, Loss: 0.0263 Epoch 19/30, Loss: 0.0246 Epoch 20/30, Loss: 0.0220 Epoch 21/30, Loss: 0.0205 Epoch 22/30, Loss: 0.0189 Epoch 23/30, Loss: 0.0180 Epoch 24/30, Loss: 0.0168 Epoch 25/30, Loss: 0.0158 Epoch 26/30, Loss: 0.0149 Epoch 27/30, Loss: 0.0141 Epoch 28/30, Loss: 0.0136 Epoch 29/30, Loss: 0.0127 Epoch 30/30, Loss: 0.0121 Entrenando modelo con activación: leaky_relu Epoch 1/30, Loss: 0.6728 Epoch 2/30, Loss: 0.3215 Epoch 3/30, Loss: 0.2099 Epoch 4/30, Loss: 0.1482 Epoch 5/30, Loss: 0.1141 Epoch 6/30, Loss: 0.0943 Epoch 7/30, Loss: 0.0734 Epoch 8/30, Loss: 0.0610 Epoch 9/30, Loss: 0.0520 Epoch 10/30, Loss: 0.0450 Epoch 11/30, Loss: 0.0400 Epoch 12/30, Loss: 0.0334 Epoch 13/30, Loss: 0.0304 Epoch 14/30, Loss: 0.0265 Epoch 15/30, Loss: 0.0241 Epoch 16/30, Loss: 0.0222 Epoch 17/30, Loss: 0.0202 Epoch 18/30, Loss: 0.0186 Epoch 19/30, Loss: 0.0180 Epoch 20/30, Loss: 0.0159 Epoch 21/30, Loss: 0.0148 Epoch 22/30, Loss: 0.0138 Epoch 23/30, Loss: 0.0131 Epoch 24/30, Loss: 0.0123 Epoch 25/30, Loss: 0.0117 Epoch 26/30, Loss: 0.0110 Epoch 27/30, Loss: 0.0105 Epoch 28/30, Loss: 0.0100 Epoch 29/30, Loss: 0.0095 Epoch 30/30, Loss: 0.0091
5. Comparación de Resultados
5.1 Curvas de pérdida (Cross-Entropy)
# ── Curvas de pérdida ─────────────────────────────────────
colors = {'sigmoid': '#E17055', 'tanh': '#6C5CE7', 'relu': '#00B894', 'leaky_relu': '#FDCB6E'}
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Loss
for act in activations:
axes[0].plot(histories[act]['loss'], label=act, lw=2, color=colors[act])
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Cross-Entropy Loss', fontsize=12)
axes[0].set_title('Evolución de la pérdida', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(alpha=0.3)
# Accuracy
for act in activations:
axes[1].plot(histories[act]['accuracy'], label=act, lw=2, color=colors[act])
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Evolución de la accuracy', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3)
axes[1].set_ylim(0, 1.05)
plt.tight_layout()
plt.show()
5.2 Accuracy final y resumen
# ── Resumen final ─────────────────────────────────────────
print(f"{'Activación':<15} {'Loss final':>12} {'Accuracy':>10}")
print('─' * 40)
for act in activations:
loss = histories[act]['loss'][-1]
acc = histories[act]['accuracy'][-1]
print(f"{act:<15} {loss:>12.4f} {acc*100:>9.1f}%")
# Bar plot de accuracy final
fig, ax = plt.subplots(figsize=(8, 4))
accs = [histories[a]['accuracy'][-1] * 100 for a in activations]
bars = ax.bar(activations, accs, color=[colors[a] for a in activations],
edgecolor='white', width=0.5)
for bar, acc in zip(bars, accs):
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
f'{acc:.1f}%', ha='center', fontsize=12)
ax.set_ylabel('Accuracy (%)', fontsize=12)
ax.set_title('Accuracy final por función de activación', fontsize=14)
ax.set_ylim(0, 105)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
6. Conclusiones
| Activación | Convergencia | Accuracy Típica | Recomendación |
|---|---|---|---|
| Sigmoid | Lenta (vanishing gradient) | Moderada | ⚠️ Evitar en capas ocultas profundas |
| Tanh | Mejor que sigmoid | Buena | ✅ Si necesitas salida centrada en 0 |
| ReLU | Rápida | Alta | ✅ Elección por defecto en la mayoría de redes |
| Leaky ReLU | Rápida | Alta | ✅ Cuando ReLU produce neuronas muertas |
Observaciones clave:
- Sigmoid converge más lentamente porque sus gradientes se saturan (son muy pequeños lejos del origen) — este es el famoso problema del vanishing gradient
- ReLU y Leaky ReLU suelen alcanzar mejor accuracy porque sus gradientes no se desvanecen para valores positivos
- Tanh es mejor que sigmoid porque está centrada en 0, pero sigue sufriendo saturación en los extremos
💡 En la práctica, ReLU es la función de activación por defecto para capas ocultas. Si sufres de neuronas muertas (dying ReLU), prueba con Leaky ReLU o ELU.