Comparativa de Modelos de Clasificación: ML vs Redes Neuronales
Comparación de Regresión Logística, Random Forest, Gradient Boosting y MLP en clasificación multiclase sobre Covertype.
Comparativa completa de modelos de clasificación: ML clásico vs redes neuronales
En este notebook construiremos un flujo de trabajo integral y didáctico para resolver un problema de clasificación supervisada multiclase con datos tabulares reales.
El objetivo es comparar, en igualdad de condiciones, modelos de distinta familia:
- Modelo lineal entrenado por gradiente (
SGDClassifier, pérdida logística). - Random Forest Classifier (ensamble de árboles por bagging).
- HistGradientBoostingClassifier (boosting de árboles con histogramas).
- Red neuronal simple (MLP pequeña en PyTorch).
- Red neuronal profunda (MLP más profunda y regularizada en PyTorch).
Trabajaremos sobre el dataset Covertype (fetch_covtype), un problema de clasificación de tipo de cobertura forestal con 7 clases y un volumen suficientemente grande para experimentar con modelos clásicos y redes.
Objetivo didáctico
Queremos responder con evidencia experimental a preguntas muy comunes en Machine Learning:
- ¿Cuándo un modelo lineal se queda corto?
- ¿Qué aportan los ensambles de árboles frente a un baseline lineal?
- ¿Las redes neuronales mejoran el rendimiento en tabular multiclase?
- ¿Cómo detectar sobreajuste y comparar modelos de forma rigurosa?
Para ello seguiremos una estructura coherente con teoría de fundamentos:
- EDA breve y preparación de datos.
- Entrenamiento de cada modelo.
- Curvas de aprendizaje en train/validación (loss y accuracy).
- Métricas comparables en test.
- Diagnóstico visual del mejor modelo.
Fundamentos matemáticos y computacionales (visión práctica)
1) Clasificación multiclase
Cada muestra $x$ pertenece a una clase $y \in {1,\dots,K}$. El modelo produce probabilidades:
$$ \hat{p}(y=k\mid x) $$
La predicción final suele ser:
$$ \hat{y}=\arg\max_k \hat{p}(y=k\mid x) $$
2) Función de pérdida: entropía cruzada
Usaremos principalmente cross-entropy (log-loss), estándar en clasificación probabilística:
$$ \mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik}\log(\hat{p}_{ik}) $$
donde $y_{ik}=1$ si la clase real de la muestra $i$ es $k$, y 0 en caso contrario.
3) Métricas de evaluación
Para no quedarnos solo con una métrica, compararemos:
- Accuracy: proporción de aciertos global.
- Balanced Accuracy: promedio del recall por clase (útil ante cierto desbalance).
- F1 macro: equilibrio entre precisión y recall, promediado por clase.
- Log-loss: calidad probabilística (menor es mejor).
4) Curvas de aprendizaje
Tras cada entrenamiento veremos curvas train/val de loss y accuracy.
- Si train mejora y validación empeora → posible sobreajuste.
- Si ambas se quedan bajas → posible infraajuste.
- Si ambas convergen bien → entrenamiento estable.
Dataset y librerías
- Dataset:
fetch_covtype(Forest CoverType, UCI). - Tarea: clasificación multiclase (7 clases).
- Tamaño: grande; tomaremos una muestra estratificada para mantener tiempos razonables en entorno docente.
- Modelos:
SGDClassifier,RandomForestClassifier,HistGradientBoostingClassifiery dos MLP en PyTorch.
# Configuración general y librerías
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.metrics import (
accuracy_score, balanced_accuracy_score, f1_score,
precision_score, recall_score, log_loss,
confusion_matrix, ConfusionMatrixDisplay, classification_report
)
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
sns.set_theme(style='whitegrid', context='notebook')
plt.rcParams['figure.figsize'] = (9, 5)
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
<torch._C.Generator at 0x72fe72d78370>
1) Carga de datos y EDA inicial
Vamos a inspeccionar el dataset, revisar balance de clases y entender la escala de variables. Al tratarse de un dataset grande, trabajaremos con una muestra estratificada para que el notebook sea reproducible en máquinas estándar.
# Carga del dataset Covertype
cov = fetch_covtype(as_frame=True)
df = cov.frame.copy()
# Ajustamos etiquetas de clase a rango [0..K-1] para PyTorch (CrossEntropyLoss)
df['Cover_Type'] = df['Cover_Type'] - 1
print('Shape total:', df.shape)
print('Número de clases:', df['Cover_Type'].nunique())
df.head()
Shape total: (581012, 55) Número de clases: 7
| Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Vertical_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | Horizontal_Distance_To_Fire_Points | ... | Soil_Type_31 | Soil_Type_32 | Soil_Type_33 | Soil_Type_34 | Soil_Type_35 | Soil_Type_36 | Soil_Type_37 | Soil_Type_38 | Soil_Type_39 | Cover_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2596.0 | 51.0 | 3.0 | 258.0 | 0.0 | 510.0 | 221.0 | 232.0 | 148.0 | 6279.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4 |
| 1 | 2590.0 | 56.0 | 2.0 | 212.0 | -6.0 | 390.0 | 220.0 | 235.0 | 151.0 | 6225.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4 |
| 2 | 2804.0 | 139.0 | 9.0 | 268.0 | 65.0 | 3180.0 | 234.0 | 238.0 | 135.0 | 6121.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 3 | 2785.0 | 155.0 | 18.0 | 242.0 | 118.0 | 3090.0 | 238.0 | 238.0 | 122.0 | 6211.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 4 | 2595.0 | 45.0 | 2.0 | 153.0 | -1.0 | 391.0 | 220.0 | 234.0 | 150.0 | 6172.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4 |
5 rows × 55 columns
# Información general y revisión de nulos
print(df.info())
print('\nNulos por columna:')
print(df.isnull().sum().sum())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 581012 entries, 0 to 581011 Data columns (total 55 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Elevation 581012 non-null float64 1 Aspect 581012 non-null float64 2 Slope 581012 non-null float64 3 Horizontal_Distance_To_Hydrology 581012 non-null float64 4 Vertical_Distance_To_Hydrology 581012 non-null float64 5 Horizontal_Distance_To_Roadways 581012 non-null float64 6 Hillshade_9am 581012 non-null float64 7 Hillshade_Noon 581012 non-null float64 8 Hillshade_3pm 581012 non-null float64 9 Horizontal_Distance_To_Fire_Points 581012 non-null float64 10 Wilderness_Area_0 581012 non-null float64 11 Wilderness_Area_1 581012 non-null float64 12 Wilderness_Area_2 581012 non-null float64 13 Wilderness_Area_3 581012 non-null float64 14 Soil_Type_0 581012 non-null float64 15 Soil_Type_1 581012 non-null float64 16 Soil_Type_2 581012 non-null float64 17 Soil_Type_3 581012 non-null float64 18 Soil_Type_4 581012 non-null float64 19 Soil_Type_5 581012 non-null float64 20 Soil_Type_6 581012 non-null float64 21 Soil_Type_7 581012 non-null float64 22 Soil_Type_8 581012 non-null float64 23 Soil_Type_9 581012 non-null float64 24 Soil_Type_10 581012 non-null float64 25 Soil_Type_11 581012 non-null float64 26 Soil_Type_12 581012 non-null float64 27 Soil_Type_13 581012 non-null float64 28 Soil_Type_14 581012 non-null float64 29 Soil_Type_15 581012 non-null float64 30 Soil_Type_16 581012 non-null float64 31 Soil_Type_17 581012 non-null float64 32 Soil_Type_18 581012 non-null float64 33 Soil_Type_19 581012 non-null float64 34 Soil_Type_20 581012 non-null float64 35 Soil_Type_21 581012 non-null float64 36 Soil_Type_22 581012 non-null float64 37 Soil_Type_23 581012 non-null float64 38 Soil_Type_24 581012 non-null float64 39 Soil_Type_25 581012 non-null float64 40 Soil_Type_26 581012 non-null float64 41 Soil_Type_27 581012 non-null float64 42 Soil_Type_28 581012 non-null float64 43 Soil_Type_29 581012 non-null float64 44 Soil_Type_30 581012 non-null float64 45 Soil_Type_31 581012 non-null float64 46 Soil_Type_32 581012 non-null float64 47 Soil_Type_33 581012 non-null float64 48 Soil_Type_34 581012 non-null float64 49 Soil_Type_35 581012 non-null float64 50 Soil_Type_36 581012 non-null float64 51 Soil_Type_37 581012 non-null float64 52 Soil_Type_38 581012 non-null float64 53 Soil_Type_39 581012 non-null float64 54 Cover_Type 581012 non-null int32 dtypes: float64(54), int32(1) memory usage: 241.6 MB None Nulos por columna: 0
# Distribución de clases
class_counts = df['Cover_Type'].value_counts().sort_index()
class_props = (class_counts / class_counts.sum()).round(4)
display(pd.DataFrame({'count': class_counts, 'proportion': class_props}))
plt.figure(figsize=(8, 4))
sns.barplot(x=class_counts.index, y=class_counts.values, palette='viridis')
plt.title('Distribución de clases (Cover_Type)')
plt.xlabel('Clase')
plt.ylabel('Número de muestras')
plt.tight_layout()
plt.show()
| count | proportion | |
|---|---|---|
| Cover_Type | ||
| 0 | 211840 | 0.3646 |
| 1 | 283301 | 0.4876 |
| 2 | 35754 | 0.0615 |
| 3 | 2747 | 0.0047 |
| 4 | 9493 | 0.0163 |
| 5 | 17367 | 0.0299 |
| 6 | 20510 | 0.0353 |
# Para EDA visual rápido, tomamos una muestra moderada
eda_sample = df.sample(5000, random_state=SEED)
num_cols = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
'Horizontal_Distance_To_Fire_Points']
_ = eda_sample[num_cols + ['Cover_Type']].hist(bins=30, figsize=(14, 10), edgecolor='black')
plt.suptitle('Distribuciones (muestra EDA)', y=1.02)
plt.tight_layout()
plt.show()
# Matriz de correlación en variables numéricas principales
corr = eda_sample[num_cols + ['Cover_Type']].corr(numeric_only=True)
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap='coolwarm', center=0)
plt.title('Correlación (muestra EDA)')
plt.tight_layout()
plt.show()
2) Muestreo estratificado y partición train/val/test
El dataset completo es grande. Para mantener equilibrio entre calidad didáctica y tiempo de ejecución, usamos una muestra estratificada de 120k instancias.
Después creamos split 70/15/15 y aplicamos escalado para modelos sensibles a magnitud (lineal y redes).
# Muestra estratificada para entrenamiento docente
N_SAMPLE = 120_000
sample_df = (
df.groupby('Cover_Type', group_keys=False)
.apply(lambda x: x.sample(int(np.floor(len(x) / len(df) * N_SAMPLE)), random_state=SEED))
.sample(frac=1, random_state=SEED)
)
# Ajuste por redondeo para llegar exactamente a N_SAMPLE
if len(sample_df) < N_SAMPLE:
faltan = N_SAMPLE - len(sample_df)
extra = df.drop(sample_df.index).sample(faltan, random_state=SEED)
sample_df = pd.concat([sample_df, extra], axis=0)
sample_df = sample_df.sample(frac=1, random_state=SEED).reset_index(drop=True)
X = sample_df.drop(columns=['Cover_Type']).values
y = sample_df['Cover_Type'].values.astype(int)
X_train_full, X_test, y_train_full, y_test = train_test_split(
X, y, test_size=0.15, random_state=SEED, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_train_full, y_train_full, test_size=0.1765, random_state=SEED, stratify=y_train_full
)
print('Train:', X_train.shape, 'Val:', X_val.shape, 'Test:', X_test.shape)
# Escalado para lineal y redes
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)
Train: (83997, 54) Val: (18003, 54) Test: (18000, 54)
3) Funciones auxiliares
def classification_metrics(y_true, y_pred, y_proba, labels):
"""Métricas principales para clasificación multiclase."""
return {
'Accuracy': accuracy_score(y_true, y_pred),
'BalancedAcc': balanced_accuracy_score(y_true, y_pred),
'F1_macro': f1_score(y_true, y_pred, average='macro'),
'Precision_macro': precision_score(y_true, y_pred, average='macro', zero_division=0),
'Recall_macro': recall_score(y_true, y_pred, average='macro', zero_division=0),
'LogLoss': log_loss(y_true, y_proba, labels=labels)
}
def plot_curves(train_values, val_values, title, ylabel):
epochs = np.arange(1, len(train_values) + 1)
plt.figure(figsize=(8, 4.5))
plt.plot(epochs, train_values, label='Train')
plt.plot(epochs, val_values, label='Validación')
plt.title(title)
plt.xlabel('Época / iteración')
plt.ylabel(ylabel)
plt.legend()
plt.tight_layout()
plt.show()
4) Modelo 1 — SGDClassifier (logístico lineal)
Usamos entrenamiento por partial_fit para poder monitorizar curvas por época.
labels = np.unique(y_train)
sgd_clf = SGDClassifier(
loss='log_loss',
penalty='l2',
alpha=1e-4,
learning_rate='optimal',
random_state=SEED
)
n_epochs = 35
sgd_train_loss, sgd_val_loss = [], []
sgd_train_acc, sgd_val_acc = [], []
for epoch in range(n_epochs):
sgd_clf.partial_fit(X_train_sc, y_train, classes=labels)
# Probabilidades y predicciones para train/val
train_proba = sgd_clf.predict_proba(X_train_sc)
val_proba = sgd_clf.predict_proba(X_val_sc)
train_pred = np.argmax(train_proba, axis=1)
val_pred = np.argmax(val_proba, axis=1)
sgd_train_loss.append(log_loss(y_train, train_proba, labels=labels))
sgd_val_loss.append(log_loss(y_val, val_proba, labels=labels))
sgd_train_acc.append(accuracy_score(y_train, train_pred))
sgd_val_acc.append(accuracy_score(y_val, val_pred))
plot_curves(sgd_train_loss, sgd_val_loss, 'SGDClassifier - Loss (log-loss)', 'Log-loss')
plot_curves(sgd_train_acc, sgd_val_acc, 'SGDClassifier - Accuracy', 'Accuracy')
sgd_test_proba = sgd_clf.predict_proba(X_test_sc)
sgd_test_pred = np.argmax(sgd_test_proba, axis=1)
sgd_metrics = classification_metrics(y_test, sgd_test_pred, sgd_test_proba, labels)
print('Métricas test SGDClassifier:', sgd_metrics)
Métricas test SGDClassifier: {'Accuracy': 0.7115, 'BalancedAcc': 0.43608227327048715, 'F1_macro': 0.43135921736101407, 'Precision_macro': 0.5367739522080731, 'Recall_macro': 0.43608227327048715, 'LogLoss': 0.6925762035020172}
5) Modelo 2 — Random Forest Classifier
Aunque no tiene "épocas" al estilo de redes, podemos observar su evolución añadiendo árboles (warm_start=True).
rf = RandomForestClassifier(
n_estimators=10,
max_depth=None,
min_samples_leaf=1,
random_state=SEED,
n_jobs=-1,
warm_start=True
)
rf_train_loss, rf_val_loss = [], []
rf_train_acc, rf_val_acc = [], []
n_trees_list = list(range(20, 221, 20))
for n_trees in n_trees_list:
rf.set_params(n_estimators=n_trees)
rf.fit(X_train, y_train)
train_proba = rf.predict_proba(X_train)
val_proba = rf.predict_proba(X_val)
train_pred = np.argmax(train_proba, axis=1)
val_pred = np.argmax(val_proba, axis=1)
rf_train_loss.append(log_loss(y_train, train_proba, labels=labels))
rf_val_loss.append(log_loss(y_val, val_proba, labels=labels))
rf_train_acc.append(accuracy_score(y_train, train_pred))
rf_val_acc.append(accuracy_score(y_val, val_pred))
plot_curves(rf_train_loss, rf_val_loss, 'Random Forest - Evolución de log-loss', 'Log-loss')
plot_curves(rf_train_acc, rf_val_acc, 'Random Forest - Evolución de accuracy', 'Accuracy')
rf_test_proba = rf.predict_proba(X_test)
rf_test_pred = np.argmax(rf_test_proba, axis=1)
rf_metrics = classification_metrics(y_test, rf_test_pred, rf_test_proba, labels)
print('Métricas test Random Forest:', rf_metrics)
Métricas test Random Forest: {'Accuracy': 0.9081666666666667, 'BalancedAcc': 0.8008013074440038, 'F1_macro': 0.8422151318812416, 'Precision_macro': 0.9036463130607373, 'Recall_macro': 0.8008013074440038, 'LogLoss': 0.2921555196106839}
6) Modelo 3 — HistGradientBoostingClassifier
Entrenamos con warm_start=True y aumentamos max_iter para obtener curvas de train/val por iteración.
hgb = HistGradientBoostingClassifier(
loss='log_loss',
learning_rate=0.08,
max_depth=8,
max_iter=1,
random_state=SEED,
warm_start=True
)
hgb_train_loss, hgb_val_loss = [], []
hgb_train_acc, hgb_val_acc = [], []
iters = list(range(10, 131, 10))
for it in iters:
hgb.set_params(max_iter=it)
hgb.fit(X_train, y_train)
train_proba = hgb.predict_proba(X_train)
val_proba = hgb.predict_proba(X_val)
train_pred = np.argmax(train_proba, axis=1)
val_pred = np.argmax(val_proba, axis=1)
hgb_train_loss.append(log_loss(y_train, train_proba, labels=labels))
hgb_val_loss.append(log_loss(y_val, val_proba, labels=labels))
hgb_train_acc.append(accuracy_score(y_train, train_pred))
hgb_val_acc.append(accuracy_score(y_val, val_pred))
plot_curves(hgb_train_loss, hgb_val_loss, 'HistGradientBoosting - Evolución de log-loss', 'Log-loss')
plot_curves(hgb_train_acc, hgb_val_acc, 'HistGradientBoosting - Evolución de accuracy', 'Accuracy')
hgb_test_proba = hgb.predict_proba(X_test)
hgb_test_pred = np.argmax(hgb_test_proba, axis=1)
hgb_metrics = classification_metrics(y_test, hgb_test_pred, hgb_test_proba, labels)
print('Métricas test HistGradientBoosting:', hgb_metrics)
Métricas test HistGradientBoosting: {'Accuracy': 0.8527222222222223, 'BalancedAcc': 0.7749441773759186, 'F1_macro': 0.7997939532587265, 'Precision_macro': 0.8365347358362979, 'Recall_macro': 0.7749441773759186, 'LogLoss': 0.37386966249793824}
7) Redes neuronales (PyTorch)
Entrenaremos dos redes:
- Simple: una capa oculta.
- Profunda: más capas, BatchNorm y Dropout.
Mostraremos en ambos casos curvas de loss y accuracy en train/val.
# Conversión a tensores
X_train_t = torch.tensor(X_train_sc, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.long)
X_val_t = torch.tensor(X_val_sc, dtype=torch.float32)
y_val_t = torch.tensor(y_val, dtype=torch.long)
X_test_t = torch.tensor(X_test_sc, dtype=torch.float32)
y_test_t = torch.tensor(y_test, dtype=torch.long)
train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=256, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val_t, y_val_t), batch_size=512, shuffle=False)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
num_classes = len(labels)
print('Dispositivo:', device, '| Clases:', num_classes)
Dispositivo: cuda | Clases: 7
class SimpleClassifier(nn.Module):
def __init__(self, in_features, n_classes):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features, 64),
nn.ReLU(),
nn.Linear(64, n_classes)
)
def forward(self, x):
return self.net(x)
class DeepClassifier(nn.Module):
def __init__(self, in_features, n_classes):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.25),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.20),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, n_classes)
)
def forward(self, x):
return self.net(x)
def train_torch_classifier(model, train_loader, val_loader, epochs=30, lr=1e-3):
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
hist = {
'train_loss': [], 'val_loss': [],
'train_acc': [], 'val_acc': []
}
for epoch in range(epochs):
# Entrenamiento
model.train()
train_losses = []
train_correct = 0
train_total = 0
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
optimizer.zero_grad()
logits = model(xb)
loss = criterion(logits, yb)
loss.backward()
optimizer.step()
train_losses.append(loss.item())
preds = torch.argmax(logits, dim=1)
train_correct += (preds == yb).sum().item()
train_total += yb.size(0)
# Validación
model.eval()
val_losses = []
val_correct = 0
val_total = 0
with torch.no_grad():
for xb, yb in val_loader:
xb, yb = xb.to(device), yb.to(device)
logits = model(xb)
loss = criterion(logits, yb)
val_losses.append(loss.item())
preds = torch.argmax(logits, dim=1)
val_correct += (preds == yb).sum().item()
val_total += yb.size(0)
hist['train_loss'].append(float(np.mean(train_losses)))
hist['val_loss'].append(float(np.mean(val_losses)))
hist['train_acc'].append(train_correct / train_total)
hist['val_acc'].append(val_correct / val_total)
return model, hist
def predict_torch_classifier(model, X_tensor):
model.eval()
with torch.no_grad():
logits = model(X_tensor.to(device))
proba = torch.softmax(logits, dim=1).cpu().numpy()
pred = np.argmax(proba, axis=1)
return pred, proba
7.1 Red neuronal simple
simple_nn = SimpleClassifier(in_features=X_train_sc.shape[1], n_classes=num_classes)
simple_nn, simple_hist = train_torch_classifier(simple_nn, train_loader, val_loader, epochs=35, lr=1e-3)
plot_curves(simple_hist['train_loss'], simple_hist['val_loss'], 'NN simple - Loss (CrossEntropy)', 'Loss')
plot_curves(simple_hist['train_acc'], simple_hist['val_acc'], 'NN simple - Accuracy', 'Accuracy')
simple_test_pred, simple_test_proba = predict_torch_classifier(simple_nn, X_test_t)
simple_metrics = classification_metrics(y_test, simple_test_pred, simple_test_proba, labels)
print('Métricas test NN simple:', simple_metrics)
Métricas test NN simple: {'Accuracy': 0.8009444444444445, 'BalancedAcc': 0.6379329409606223, 'F1_macro': 0.6688044063662409, 'Precision_macro': 0.7295193041576231, 'Recall_macro': 0.6379329409606223, 'LogLoss': 0.47379074081632777}
7.2 Red neuronal profunda
deep_nn = DeepClassifier(in_features=X_train_sc.shape[1], n_classes=num_classes)
deep_nn, deep_hist = train_torch_classifier(deep_nn, train_loader, val_loader, epochs=45, lr=8e-4)
plot_curves(deep_hist['train_loss'], deep_hist['val_loss'], 'NN profunda - Loss (CrossEntropy)', 'Loss')
plot_curves(deep_hist['train_acc'], deep_hist['val_acc'], 'NN profunda - Accuracy', 'Accuracy')
deep_test_pred, deep_test_proba = predict_torch_classifier(deep_nn, X_test_t)
deep_metrics = classification_metrics(y_test, deep_test_pred, deep_test_proba, labels)
print('Métricas test NN profunda:', deep_metrics)
Métricas test NN profunda: {'Accuracy': 0.863, 'BalancedAcc': 0.7444783934602474, 'F1_macro': 0.7781052145852312, 'Precision_macro': 0.8287015915424792, 'Recall_macro': 0.7444783934602474, 'LogLoss': 0.33583901012467643}
8) Comparativa global de modelos
results = pd.DataFrame([
{'Modelo': 'SGDClassifier (lineal)', **sgd_metrics},
{'Modelo': 'Random Forest', **rf_metrics},
{'Modelo': 'HistGradientBoosting', **hgb_metrics},
{'Modelo': 'NN simple (PyTorch)', **simple_metrics},
{'Modelo': 'NN profunda (PyTorch)', **deep_metrics},
]).sort_values('Accuracy', ascending=False)
results
| Modelo | Accuracy | BalancedAcc | F1_macro | Precision_macro | Recall_macro | LogLoss | |
|---|---|---|---|---|---|---|---|
| 1 | Random Forest | 0.908167 | 0.800801 | 0.842215 | 0.903646 | 0.800801 | 0.292156 |
| 4 | NN profunda (PyTorch) | 0.863000 | 0.744478 | 0.778105 | 0.828702 | 0.744478 | 0.335839 |
| 2 | HistGradientBoosting | 0.852722 | 0.774944 | 0.799794 | 0.836535 | 0.774944 | 0.373870 |
| 3 | NN simple (PyTorch) | 0.800944 | 0.637933 | 0.668804 | 0.729519 | 0.637933 | 0.473791 |
| 0 | SGDClassifier (lineal) | 0.711500 | 0.436082 | 0.431359 | 0.536774 | 0.436082 | 0.692576 |
fig, axes = plt.subplots(1, 3, figsize=(18, 4.5))
sns.barplot(data=results, x='Accuracy', y='Modelo', ax=axes[0], palette='Blues')
axes[0].set_title('Accuracy (mayor es mejor)')
sns.barplot(data=results, x='F1_macro', y='Modelo', ax=axes[1], palette='Greens')
axes[1].set_title('F1 macro (mayor es mejor)')
sns.barplot(data=results, x='LogLoss', y='Modelo', ax=axes[2], palette='Reds_r')
axes[2].set_title('Log-loss (menor es mejor)')
plt.tight_layout()
plt.show()
9) Diagnóstico del mejor modelo
Seleccionamos automáticamente el mejor por Accuracy y mostramos:
- Matriz de confusión.
- Reporte de clasificación por clase.
- Distribución de confianza de predicciones correctas/incorrectas.
best_model_name = results.iloc[0]['Modelo']
print('Mejor modelo según Accuracy:', best_model_name)
pred_map = {
'SGDClassifier (lineal)': (sgd_test_pred, sgd_test_proba),
'Random Forest': (rf_test_pred, rf_test_proba),
'HistGradientBoosting': (hgb_test_pred, hgb_test_proba),
'NN simple (PyTorch)': (simple_test_pred, simple_test_proba),
'NN profunda (PyTorch)': (deep_test_pred, deep_test_proba),
}
best_pred, best_proba = pred_map[best_model_name]
# Matriz de confusión
disp = ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix(y_test, best_pred, labels=labels),
display_labels=labels
)
fig, ax = plt.subplots(figsize=(7, 7))
disp.plot(ax=ax, cmap='Blues', colorbar=False)
plt.title(f'Matriz de confusión - {best_model_name}')
plt.tight_layout()
plt.show()
# Reporte de clasificación
print(classification_report(y_test, best_pred, digits=4))
Mejor modelo según Accuracy: Random Forest
precision recall f1-score support
0 0.9247 0.8885 0.9062 6563
1 0.8994 0.9490 0.9235 8777
2 0.8784 0.9197 0.8986 1108
3 0.8971 0.7176 0.7974 85
4 0.8876 0.5102 0.6479 294
5 0.8830 0.7435 0.8073 538
6 0.9554 0.8772 0.9146 635
accuracy 0.9082 18000
macro avg 0.9036 0.8008 0.8422 18000
weighted avg 0.9086 0.9082 0.9068 18000
# Confianza máxima de la predicción
best_conf = best_proba.max(axis=1)
correct = (best_pred == y_test)
plt.figure(figsize=(9, 4.5))
sns.kdeplot(best_conf[correct], label='Predicciones correctas', fill=True)
sns.kdeplot(best_conf[~correct], label='Predicciones incorrectas', fill=True)
plt.title('Distribución de confianza del mejor modelo')
plt.xlabel('Probabilidad máxima predicha')
plt.ylabel('Densidad')
plt.legend()
plt.tight_layout()
plt.show()
10) Bloque de tests rápidos (sanity checks)
Pequeñas comprobaciones para validar que el pipeline produce resultados coherentes.
assert all(np.isfinite(results['Accuracy'])), 'Accuracy no finita detectada'
assert all(np.isfinite(results['F1_macro'])), 'F1 no finita detectada'
assert all(np.isfinite(results['LogLoss'])), 'LogLoss no finita detectada'
assert (results['Accuracy'] >= 0).all() and (results['Accuracy'] <= 1).all(), 'Accuracy fuera de rango'
print('Accuracy mejor modelo:', float(results.iloc[0]['Accuracy']))
print('LogLoss mejor modelo:', float(results.sort_values('LogLoss').iloc[0]['LogLoss']))
print('✅ Sanity checks completados correctamente')
Accuracy mejor modelo: 0.9081666666666667 LogLoss mejor modelo: 0.2921555196106839 ✅ Sanity checks completados correctamente
Conclusiones y siguientes pasos
Conclusiones
- El baseline lineal (
SGDClassifier) es útil para fijar referencia, pero suele limitarse cuando hay fronteras de decisión no lineales. - Los ensambles de árboles (Random Forest / HistGradientBoosting) suelen rendir muy bien en tabular multiclase y son robustos.
- Las redes neuronales pueden ser muy competitivas, pero exigen más tuning de arquitectura, regularización y optimización.
- No basta con una sola métrica: conviene combinar accuracy + F1 macro + log-loss para evaluar tanto acierto como calidad probabilística.
Qué podrías probar después
- Búsqueda de hiperparámetros (Random Search / Bayesiana).
- Early stopping y scheduler de learning rate en redes.
- Rebalanceo de clases (si fuera necesario) o class weights.
- Modelos extra: XGBoost / LightGBM / CatBoost.
- Ensambles por stacking entre modelos clásicos y redes.
- Validación cruzada estratificada para mayor robustez.
Mensaje clave: en clasificación real, el mejor modelo no se elige por intuición sino por comparación empírica rigurosa + análisis de errores.