当前位置：首页 > news >正文

基于大核感知与非膨胀卷积的SPPF改进—融合UniRepLK的YOLOv8目标检测创新架构

news 来源：原创 2025/7/21 12:04:21

在当前目标检测领域中，YOLO系列模型因其优异的速度-精度平衡能力而被广泛部署于工业界与科研场景。YOLOv8作为该系列的最新版本，在主干网络与特征金字塔结构上进行了多项优化，进一步提升了其实时性与鲁棒性。然而，其核心组件—SPPF(Spatial Pyramid Pooling Fast)模块仍采用传统池化操作，导致模型在复杂场景下对小目标和遮挡目标的识别能力受限。为解决这一问题，本文提出一种新型的ILK-SPPF(Improved Large Kernel SPPF)模块，将UniRepLKNet中的大kernel感知机制引入YOLOv8架构中，通过非膨胀卷积与结构重参数化策略显著提升模型的感受野与上下文建模能力。

1. SPP与SPPF结构演进及在YOLO中的应用

空间金字塔池化(Spatial Pyramid Pooling, SPP )用于解决CNN输入尺寸固定的问题，并增强特征图的空间不变性。该结构在多个目标检测框架中得到应用，如Fast R-CNN、YOLOv3等。随着轻量化需求的增长，Ultralytics团队在YOLOv5 v6.0版本后引入了SPPF(Spatial Pyramid Pooling Fast)结构。相较于原始SPP采用并行不同窗口大小的最大池化，SPPF通过串行堆叠方式显著降低了计算量，在保持性能的同时提高了推理速度。

在这里插入图片描述

在YOLOv8中，SPPF仍作为主干网络的重要组成部分，其主要作用包括：

提升特征图的全局感知能力
增强多尺度目标适应性
优化特征金字塔的语义表达

然而，受限于其内部采用的标准池化操作(如5×5 max pool)，SPPF在感受野扩展方面存在瓶颈，难以有效捕捉长距离依赖关系，尤其在面对小目标或遮挡目标时表现受限。

2. UniRepLKNet介绍

论文地址：https://arxiv.org/pdf/2311.15599
代码地址：https://github.com/AILab-CVC/UniRepLKNet

UniRepLKNet的架构设计：LarK块由本文提出的扩张重新参数块、SE块、前馈网络(FFN)和批量归一化(BN)层组成。SmaK块与LarK块的唯一区别在于前者使用深度可分离的3×3卷积层替代后者中的扩张重新参数块。各个阶段通过步幅为2的稠密3×3卷积层实现下采样块进行连接。
在这里插入图片描述
UniRepLKNetBlock原理图：

2.1 大卷积核CNN设计原则

近年来，大内核卷积(Large Kernel Convolution)因其能够直接扩大感受野、减少堆叠层数、增强空间建模能力而成为视觉模型设计的新趋势。尤其是在UniRepLKNet等工作中，作者验证了大kernel(如13×13)结合非膨胀卷积与结构重参数化机制，在不牺牲效率的前提下可实现性能跃升。

2.1.1 局部结构设计：高效特征提取的基石

为在有限计算资源下最大化模型性能，本文在局部模块设计上采用以下策略：

SE注意力机制(Squeeze-and-Excitation)：在通道维度引入动态权重分配机制，强化重要通道特征。
Bottleneck结构 ：通过1×1卷积降维后再进行3×3卷积运算，降低计算开销。
Depthwise Separable Convolution ：进一步压缩参数量与FLOPs，提高部署友好性。

这些结构在保证精度的前提下，为后续引入大kernel模块提供了“腾挪空间”。

2.1.2 重参数化策略：从多分支到单卷积的等价转换

为了兼顾训练灵活性与推理效率，本文借鉴UniRepLKNet的设计思想，提出了一个Dilated Reparam Block 子模块。其核心逻辑如下：

在训练阶段，采用并行多分支结构：
- 一条路径为标准3×3深度可分离卷积
- 另一条路径为膨胀卷积(Dilated Convolution)
利用结构重参数化(Structural Re-parameterization)，将上述多分支结构在推理阶段等价融合为一个单一的大kernel卷积(如7×7或13×13)
数学上，小kernel + 膨胀卷积 ≈ 等效大kernel + 非膨胀卷积

这种策略不仅提升了模型的表达能力，还避免了空洞卷积带来的网格效应(Grid Effect)和边缘响应失真问题。

2.1.3 卷积核尺寸选择与Scaling Law分析

在实际应用中，卷积核尺寸的选择需根据下游任务特性与整体网络架构进行权衡：

对于本文的目标检测任务，实验表明13×13 kernel size已足够满足大多数场景下的感受野需求
过大的kernel(如17×17以上)虽能进一步提升感受野，但会带来显著的计算负担，且边际收益递减

此外，针对模型缩放(Scaling)过程中的设计经验总结如下：

在小型模型(如YOLO-Tiny级别)中，适当引入多个大kernel模块有助于快速提升模型容量
当模型层级增加至Base级别(如36层)后，继续增加大kernel数量不再显著提升性能
此时应优先采用高效的depthwise 3×3卷积来提升特征抽象层次，而非盲目追求更大kernel

3. SPPF模型的创新机制与架构解析图

将 UniRepLKNetBlock和SPPF高效结合，进行创新：
在这里插入图片描述

3.1 SPPF引入YOLOv8

在ultralytics/nn下新建sppf文件包，并新建UniRepLKNet_SPPF.py加入UniRepLKNet_SPPF

import torch
import torch.nn as nn
import torch.nn.functional as F
from timm.models.layers import trunc_normal_, DropPath, to_2tuple
from functools import partial
import torch.utils.checkpoint as checkpoint
import numpy as np
from ultralytics.nn.modules import *class GRNwithNHWC(nn.Module):""" GRN (Global Response Normalization) layerOriginally proposed in ConvNeXt V2 (https://arxiv.org/abs/2301.00808)This implementation is more efficient than the original (https://github.com/facebookresearch/ConvNeXt-V2)We assume the inputs to this layer are (N, H, W, C)"""def __init__(self, dim, use_bias=True):super().__init__()self.use_bias = use_biasself.gamma = nn.Parameter(torch.zeros(1, 1, 1, dim))if self.use_bias:self.beta = nn.Parameter(torch.zeros(1, 1, 1, dim))def forward(self, x):Gx = torch.norm(x, p=2, dim=(1, 2), keepdim=True)Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)if self.use_bias:return (self.gamma * Nx + 1) * x + self.betaelse:return (self.gamma * Nx + 1) * xclass NCHWtoNHWC(nn.Module):def __init__(self):super().__init__()def forward(self, x):return x.permute(0, 2, 3, 1)class NHWCtoNCHW(nn.Module):def __init__(self):super().__init__()def forward(self, x):return x.permute(0, 3, 1, 2)# ================== This function decides which conv implementation (the native or iGEMM) to use
#   Note that iGEMM large-kernel conv impl will be used if
#       -   you attempt to do so (attempt_to_use_large_impl=True), and
#       -   it has been installed (follow https://github.com/AILab-CVC/UniRepLKNet), and
#       -   the conv layer is depth-wise, stride = 1, non-dilated, kernel_size > 5, and padding == kernel_size // 2
def get_conv2d(in_channels, out_channels, kernel_size, stride, padding, dilation, groups, bias,attempt_use_lk_impl=True):kernel_size = to_2tuple(kernel_size)if padding is None:padding = (kernel_size[0] // 2, kernel_size[1] // 2)else:padding = to_2tuple(padding)need_large_impl = kernel_size[0] == kernel_size[1] and kernel_size[0] > 5 and padding == (kernel_size[0] // 2, kernel_size[1] // 2)# if attempt_use_lk_impl and need_large_impl:#     print('---------------- trying to import iGEMM implementation for large-kernel conv')#     try:#         from depthwise_conv2d_implicit_gemm import DepthWiseConv2dImplicitGEMM#         print('---------------- found iGEMM implementation ')#     except:#         DepthWiseConv2dImplicitGEMM = None#         print('---------------- found no iGEMM. use original conv. follow https://github.com/AILab-CVC/UniRepLKNet to install it.')#     if DepthWiseConv2dImplicitGEMM is not None and need_large_impl and in_channels == out_channels \#             and out_channels == groups and stride == 1 and dilation == 1:#         print(f'===== iGEMM Efficient Conv Impl, channels {in_channels}, kernel size {kernel_size} =====')#         return DepthWiseConv2dImplicitGEMM(in_channels, kernel_size, bias=bias)return nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride,padding=padding, dilation=dilation, groups=groups, bias=bias)def get_bn(dim, use_sync_bn=False):if use_sync_bn:return nn.SyncBatchNorm(dim)else:return nn.BatchNorm2d(dim)class SEBlock(nn.Module):"""Squeeze-and-Excitation Block proposed in SENet (https://arxiv.org/abs/1709.01507)We assume the inputs to this layer are (N, C, H, W)"""def __init__(self, input_channels, internal_neurons):super(SEBlock, self).__init__()self.down = nn.Conv2d(in_channels=input_channels, out_channels=internal_neurons,kernel_size=1, stride=1, bias=True)self.up = nn.Conv2d(in_channels=internal_neurons, out_channels=input_channels,kernel_size=1, stride=1, bias=True)self.input_channels = input_channelsself.nonlinear = nn.ReLU(inplace=True)def forward(self, inputs):x = F.adaptive_avg_pool2d(inputs, output_size=(1, 1))x = self.down(x)x = self.nonlinear(x)x = self.up(x)x = F.sigmoid(x)return inputs * x.view(-1, self.input_channels, 1, 1)def fuse_bn(conv, bn):conv_bias = 0 if conv.bias is None else conv.biasstd = (bn.running_var + bn.eps).sqrt()return conv.weight * (bn.weight / std).reshape(-1, 1, 1, 1), bn.bias + (conv_bias - bn.running_mean) * bn.weight / stddef convert_dilated_to_nondilated(kernel, dilate_rate):identity_kernel = torch.ones((1, 1, 1, 1))if kernel.size(1) == 1:#   This is a DW kerneldilated = F.conv_transpose2d(kernel, identity_kernel, stride=dilate_rate)return dilatedelse:#   This is a dense or group-wise (but not DW) kernelslices = []for i in range(kernel.size(1)):dilated = F.conv_transpose2d(kernel[:, i:i + 1, :, :], identity_kernel, stride=dilate_rate)slices.append(dilated)return torch.cat(slices, dim=1)def merge_dilated_into_large_kernel(large_kernel, dilated_kernel, dilated_r):large_k = large_kernel.size(2)dilated_k = dilated_kernel.size(2)equivalent_kernel_size = dilated_r * (dilated_k - 1) + 1equivalent_kernel = convert_dilated_to_nondilated(dilated_kernel, dilated_r)rows_to_pad = large_k // 2 - equivalent_kernel_size // 2merged_kernel = large_kernel + F.pad(equivalent_kernel, [rows_to_pad] * 4)return merged_kernelclass DilatedReparamBlock(nn.Module):"""Dilated Reparam Block proposed in UniRepLKNet (https://github.com/AILab-CVC/UniRepLKNet)We assume the inputs to this block are (N, C, H, W)"""def __init__(self, channels, kernel_size, deploy, use_sync_bn=False, attempt_use_lk_impl=True):super().__init__()self.lk_origin = get_conv2d(channels, channels, kernel_size, stride=1,padding=kernel_size // 2, dilation=1, groups=channels, bias=deploy,attempt_use_lk_impl=attempt_use_lk_impl)self.attempt_use_lk_impl = attempt_use_lk_impl#   Default settings. We did not tune them carefully. Different settings may work better.if kernel_size == 17:self.kernel_sizes = [5, 9, 3, 3, 3]self.dilates = [1, 2, 4, 5, 7]elif kernel_size == 15:self.kernel_sizes = [5, 7, 3, 3, 3]self.dilates = [1, 2, 3, 5, 7]elif kernel_size == 13:self.kernel_sizes = [5, 7, 3, 3, 3]self.dilates = [1, 2, 3, 4, 5]elif kernel_size == 11:self.kernel_sizes = [5, 5, 3, 3, 3]self.dilates = [1, 2, 3, 4, 5]elif kernel_size == 9:self.kernel_sizes = [5, 5, 3, 3]self.dilates = [1, 2, 3, 4]elif kernel_size == 7:self.kernel_sizes = [5, 3, 3]self.dilates = [1, 2, 3]elif kernel_size == 5:self.kernel_sizes = [3, 3]self.dilates = [1, 2]else:raise ValueError('Dilated Reparam Block requires kernel_size >= 5')if not deploy:self.origin_bn = get_bn(channels, use_sync_bn)for k, r in zip(self.kernel_sizes, self.dilates):self.__setattr__('dil_conv_k{}_{}'.format(k, r),nn.Conv2d(in_channels=channels, out_channels=channels, kernel_size=k, stride=1,padding=(r * (k - 1) + 1) // 2, dilation=r, groups=channels,bias=False))self.__setattr__('dil_bn_k{}_{}'.format(k, r), get_bn(channels, use_sync_bn=use_sync_bn))def forward(self, x):if not hasattr(self, 'origin_bn'):  # deploy modereturn self.lk_origin(x)out = self.origin_bn(self.lk_origin(x))for k, r in zip(self.kernel_sizes, self.dilates):conv = self.__getattr__('dil_conv_k{}_{}'.format(k, r))bn = self.__getattr__('dil_bn_k{}_{}'.format(k, r))out = out + bn(conv(x))return outdef merge_dilated_branches(self):if hasattr(self, 'origin_bn'):origin_k, origin_b = fuse_bn(self.lk_origin, self.origin_bn)for k, r in zip(self.kernel_sizes, self.dilates):conv = self.__getattr__('dil_conv_k{}_{}'.format(k, r))bn = self.__getattr__('dil_bn_k{}_{}'.format(k, r))branch_k, branch_b = fuse_bn(conv, bn)origin_k = merge_dilated_into_large_kernel(origin_k, branch_k, r)origin_b += branch_bmerged_conv = get_conv2d(origin_k.size(0), origin_k.size(0), origin_k.size(2), stride=1,padding=origin_k.size(2) // 2, dilation=1, groups=origin_k.size(0), bias=True,attempt_use_lk_impl=self.attempt_use_lk_impl)merged_conv.weight.data = origin_kmerged_conv.bias.data = origin_bself.lk_origin = merged_convself.__delattr__('origin_bn')for k, r in zip(self.kernel_sizes, self.dilates):self.__delattr__('dil_conv_k{}_{}'.format(k, r))self.__delattr__('dil_bn_k{}_{}'.format(k, r))class UniRepLKNetBlock(nn.Module):def __init__(self,dim,kernel_size,drop_path=0.,layer_scale_init_value=1e-6,deploy=False,attempt_use_lk_impl=True,with_cp=False,use_sync_bn=False,ffn_factor=4):super().__init__()self.with_cp = with_cp# if deploy:#     print('------------------------------- Note: deploy mode')# if self.with_cp:#     print('****** note with_cp = True, reduce memory consumption but may slow down training ******')self.need_contiguous = (not deploy) or kernel_size >= 7if kernel_size == 0:self.dwconv = nn.Identity()self.norm = nn.Identity()elif deploy:self.dwconv = get_conv2d(dim, dim, kernel_size=kernel_size, stride=1, padding=kernel_size // 2,dilation=1, groups=dim, bias=True,attempt_use_lk_impl=attempt_use_lk_impl)self.norm = nn.Identity()elif kernel_size >= 7:self.dwconv = DilatedReparamBlock(dim, kernel_size, deploy=deploy,use_sync_bn=use_sync_bn,attempt_use_lk_impl=attempt_use_lk_impl)self.norm = get_bn(dim, use_sync_bn=use_sync_bn)elif kernel_size == 1:self.dwconv = nn.Conv2d(dim, dim, kernel_size=kernel_size, stride=1, padding=kernel_size // 2,dilation=1, groups=1, bias=deploy)self.norm = get_bn(dim, use_sync_bn=use_sync_bn)else:assert kernel_size in [3, 5]self.dwconv = nn.Conv2d(dim, dim, kernel_size=kernel_size, stride=1, padding=kernel_size // 2,dilation=1, groups=dim, bias=deploy)self.norm = get_bn(dim, use_sync_bn=use_sync_bn)self.se = SEBlock(dim, dim // 4)ffn_dim = int(ffn_factor * dim)self.pwconv1 = nn.Sequential(NCHWtoNHWC(),nn.Linear(dim, ffn_dim))self.act = nn.Sequential(nn.GELU(),GRNwithNHWC(ffn_dim, use_bias=not deploy))if deploy:self.pwconv2 = nn.Sequential(nn.Linear(ffn_dim, dim),NHWCtoNCHW())else:self.pwconv2 = nn.Sequential(nn.Linear(ffn_dim, dim, bias=False),NHWCtoNCHW(),get_bn(dim, use_sync_bn=use_sync_bn))self.gamma = nn.Parameter(layer_scale_init_value * torch.ones(dim),requires_grad=True) if (not deploy) and layer_scale_init_value is not None \and layer_scale_init_value > 0 else Noneself.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()def forward(self, inputs):def _f(x):if self.need_contiguous:x = x.contiguous()y = self.se(self.norm(self.dwconv(x)))y = self.pwconv2(self.act(self.pwconv1(y)))if self.gamma is not None:y = self.gamma.view(1, -1, 1, 1) * yreturn self.drop_path(y) + xif self.with_cp and inputs.requires_grad:return checkpoint.checkpoint(_f, inputs)else:return _f(inputs)def reparameterize(self):if hasattr(self.dwconv, 'merge_dilated_branches'):self.dwconv.merge_dilated_branches()if hasattr(self.norm, 'running_var') and hasattr(self.dwconv, 'lk_origin'):std = (self.norm.running_var + self.norm.eps).sqrt()self.dwconv.lk_origin.weight.data *= (self.norm.weight / std).view(-1, 1, 1, 1)self.dwconv.lk_origin.bias.data = self.norm.bias + (self.dwconv.lk_origin.bias - self.norm.running_mean) * self.norm.weight / stdself.norm = nn.Identity()if self.gamma is not None:final_scale = self.gamma.dataself.gamma = Noneelse:final_scale = 1if self.act[1].use_bias and len(self.pwconv2) == 3:grn_bias = self.act[1].beta.dataself.act[1].__delattr__('beta')self.act[1].use_bias = Falselinear = self.pwconv2[0]grn_bias_projected_bias = (linear.weight.data @ grn_bias.view(-1, 1)).squeeze()bn = self.pwconv2[2]std = (bn.running_var + bn.eps).sqrt()new_linear = nn.Linear(linear.in_features, linear.out_features, bias=True)new_linear.weight.data = linear.weight * (bn.weight / std * final_scale).view(-1, 1)linear_bias = 0 if linear.bias is None else linear.bias.datalinear_bias += grn_bias_projected_biasnew_linear.bias.data = (bn.bias + (linear_bias - bn.running_mean) * bn.weight / std) * final_scaleself.pwconv2 = nn.Sequential(new_linear, self.pwconv2[1])class SPPF_UniRepLK(nn.Module):"""Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher."""def __init__(self, c1, c2, k=5):  # equivalent to SPP(k=(5, 9, 13))super().__init__()c_ = c1 // 2  # hidden channelsself.cv1 = Conv(c1, c_, 1, 1)self.cv2 = Conv(c_ * 4, c2, 1, 1)self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)self.UniRepLK = UniRepLKNetBlock(c_ * 4, kernel_size=k)def forward(self, x):"""Forward pass through Ghost Convolution block."""x = self.cv1(x)y1 = self.m(x)y2 = self.m(y1)return self.cv2(self.UniRepLK(torch.cat((x, y1, y2, self.m(y2)), 1)))

在ultralytics/nn/tasks.py注册SPPF_UniRepLK

from ultralytics.nn.sppf.UniRepLKNet_SPPF import SPPF_UniRepLK

在这里插入图片描述

修改def parse_model(d, ch, verbose=True): # model_dict, input_channels(3)，加上 SPPF_UniRepLK

 if m in {Classify, Conv, ConvTranspose, GhostConv, Bottleneck, GhostBottleneck, SPP, SPPF, DWConv, Focus,BottleneckCSP, C1, C2, C2f, C3, C3TR, C3Ghost, nn.ConvTranspose2d, DWConvTranspose2d, C3x, SPPF_UniRepLK}:c1, c2 = ch[f], args[0]if c2 != nc:  # if c2 not equal to number of classes (i.e. for Classify() output)c2 = make_divisible(c2 * gw, 8)

在这里插入图片描述

yolov8_SPPF_UniRepLK.yaml

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect# Parameters
nc: 80  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'# [depth, width, max_channels]n: [0.33, 0.25, 1024]  # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPss: [0.33, 0.50, 1024]  # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients,  28.8 GFLOPsm: [0.67, 0.75, 768]   # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients,  79.3 GFLOPsl: [1.00, 1.00, 512]   # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPsx: [1.00, 1.25, 512]   # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs# YOLOv8.0n backbone
backbone:# [from, repeats, module, args]- [-1, 1, Conv, [64, 3, 2]]  # 0-P1/2- [-1, 1, Conv, [128, 3, 2]]  # 1-P2/4- [-1, 3, C2f, [128, True]]- [-1, 1, Conv, [256, 3, 2]]  # 3-P3/8- [-1, 6, C2f, [256, True]]- [-1, 1, Conv, [512, 3, 2]]  # 5-P4/16- [-1, 6, C2f, [512, True]]- [-1, 1, Conv, [1024, 3, 2]]  # 7-P5/32- [-1, 3, C2f, [1024, True]]- [-1, 1, SPPF_UniRepLK, [1024, 5]]  # 9# YOLOv8.0n head
head:- [-1, 1, nn.Upsample, [None, 2, 'nearest']]- [[-1, 6], 1, Concat, [1]]  # cat backbone P4- [-1, 3, C2f, [512]]  # 12- [-1, 1, nn.Upsample, [None, 2, 'nearest']]- [[-1, 4], 1, Concat, [1]]  # cat backbone P3- [-1, 3, C2f, [256]]  # 15 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]]- [[-1, 12], 1, Concat, [1]]  # cat head P4- [-1, 3, C2f, [512]]  # 18 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]]- [[-1, 9], 1, Concat, [1]]  # cat head P5- [-1, 3, C2f, [1024]]  # 21 (P5/32-large)- [[15, 18, 21], 1, Detect, [nc]]  # Detect(P3, P4, P5)

4. 结论与未来展望

本文针对YOLOv8中SPPF模块感受野受限的问题，提出了一种融合UniRepLK大核感知与非膨胀卷积的改进方案。通过引入多路径大核卷积、重参数化结构与通道注意力机制，显著提升了模型在复杂场景下的目标识别能力，尤其是在小目标检测方面表现出色。未来工作将进一步探索该模块在其他检测框架（如YOLO-NAS、RTMDet）中的泛化能力，并尝试结合Transformer结构实现更强大的全局建模能力。

1. SPP与SPPF结构演进及在YOLO中的应用

2. UniRepLKNet介绍

2.1 大卷积核CNN设计原则

2.1.1 局部结构设计：高效特征提取的基石

2.1.2 重参数化策略：从多分支到单卷积的等价转换

2.1.3 卷积核尺寸选择与Scaling Law分析

3. SPPF模型的创新机制与架构解析图

3.1 SPPF引入YOLOv8

4. 结论与未来展望

相关文章：