【STM32U3评测】轻量级人脸检测部署。

[复制链接]

stmCortex 发布时间：2026-6-27 12:40

技术帖
文章封面:
文章简介:	测试STM32U3C5 的 HSP 硬件加速能力，验证在 MCU 资源受限环境下部署轻量级人脸检测模型的可行性

【STM32U3评测】轻量级人脸检测部署

1. 测试目的

本次测试面向 STM32U3C5 的 HSP 硬件加速能力，验证在 MCU 资源受限环境下部署轻量级人脸检测模型的可行性，并对比 CPU backend 与 HSP backend 的推理速度、输出一致性和最终检测框结果。

2. 测试平台与软件环境

项目	内容
MCU	STM32U3C5
CPU	Arm Cortex-M33
主频	96 MHz
AI 工具链	ST Edge AI
加速单元	HSP
模型	BlazeFace 128×128
输入格式	128×128×3 float32 RGB

3. 模型转换

安装STEdgeAI工具，一路next，安装完成之后，将Utilities目录添加至环境变量

$Env:Path += ";F:\Users\legen\AppData\Local\Programs\STEdgeAI\4.0\Utilities\windows"

测试安装：

PS: > stedgeai --version
ST Edge AI Core v4.0.1-20581 7ed50de05
   ISPU 2.0.1-RC2
   MLC 1.2.4-RC2
   StellarAI 4.0.1-RC2
   STM32CubeAI 12.0.1-RC2

本次测试使用轻量级 FaceDetect 模型，模型输入为 128×128 RGB 图像，数据类型为 float32，输入数据量为：

128 × 128 × 3 × 4 bytes = 196,608 bytes

模型基础信息如下：

项目	数值
输入尺寸	128×128×3
输入类型	float32

从st-model-zoo下载该模型，使用stedgeai转换，这里我们转换hsp和cpu两种版本，方便后面的精度和性能对比

# hsp
stedgeai.exe generate -m blazeface_front_128_int8.tflite --target stm32u3 --c-api st-ai -O time --hsp 4096
# cpu
stedgeai.exe generate -m blazeface_front_128_int8.tflite --target stm32u3 --c-api st-ai -O time

转换后的统计数据如下：

项目	数值
MACC	31,849,356
权重大小	107,956 B
输出大小	60,932 B
CPU backend nodes	110
HSP backend nodes	94
CPU activation	328,056 B
HSP activation	327,680 B

从节点数量看，HSP backend 版本经过 ST Edge AI 生成后，网络执行图由 CPU 版的 110 个 node 降为 94 个 node。两版模型的 MACC 相同，均为 31,849,356。

4. 代码时间

4.1 数据预处理

工程参考STM32U3-GettingStarted-HumanActivityRecognition，为了便于重复测试，我们先将图片以及anchor转换为数组，转换脚本如下：

def write_c(img: Image.Image, output: Path, source_name: str) -> None:
    w, h = img.size
    pixels = list(img.getdata())
    floats: list[float] = []
    for r, g, b in pixels:
        floats.extend([r / 255.0, g / 255.0, b / 255.0])

    output.parent.mkdir(parents=True, exist_ok=True)
    with output.open("w", encoding="utf-8", newline="\n") as f:
        f.write("/* Auto-generated by tools/image_to_c.py. Do not edit manually. */\n")
        f.write(f"/* Source image: {source_name} */\n")
        f.write('#include "facedetect_assets.h"\n\n')
        f.write("#if defined(__GNUC__)\n")
        f.write("#define FD_ALIGN(x) __attribute__((aligned(x)))\n")
        f.write("#define FD_SECTION(name) __attribute__((section(name)))\n")
        f.write("#else\n#define FD_ALIGN(x)\n#define FD_SECTION(name)\n#endif\n\n")
        f.write('const uint32_t g_fd_image_width FD_SECTION(".fd_image_rodata") = FD_INPUT_W;\n')
        f.write('const uint32_t g_fd_image_height FD_SECTION(".fd_image_rodata") = FD_INPUT_H;\n')
        f.write('const float g_fd_image_rgb128_f32[FD_IMAGE_FLOAT_COUNT] FD_ALIGN(32) FD_SECTION(".fd_image_rodata") = {\n')
        for i, v in enumerate(floats):
            f.write(f"{v:.8f}f,")
            if (i + 1) % 6 == 0:
                f.write("\n")
            else:
                f.write(" ")
        f.write("\n};\n")

anchor生成：

def generate_anchors() -> list[tuple[float, float]]:
    anchors: list[tuple[float, float]] = []
    for y in range(16):
        for x in range(16):
            for _ in range(2):
                anchors.append(((x + 0.5) / 16.0, (y + 0.5) / 16.0))
    for y in range(8):
        for x in range(8):
            for _ in range(6):
                anchors.append(((x + 0.5) / 8.0, (y + 0.5) / 8.0))
    assert len(anchors) == 896
    return anchors


def write_c(output: Path, anchors: list[tuple[float, float]]) -> None:
    output.parent.mkdir(parents=True, exist_ok=True)
    with output.open("w", encoding="utf-8", newline="\n") as f:
        f.write("/* Auto-generated by tools/generate_anchors.py. Do not edit manually. */\n")
        f.write('#include "facedetect_assets.h"\n\n')
        f.write("#if defined(__GNUC__)\n")
        f.write("#define FD_ALIGN(x) __attribute__((aligned(x)))\n")
        f.write("#define FD_SECTION(name) __attribute__((section(name)))\n")
        f.write("#else\n#define FD_ALIGN(x)\n#define FD_SECTION(name)\n#endif\n\n")
        f.write('const uint32_t g_fd_anchor_count FD_SECTION(".fd_anchor_rodata") = FD_ANCHOR_COUNT;\n')
        f.write('const fd_anchor_t g_fd_anchors[FD_ANCHOR_COUNT] FD_ALIGN(32) FD_SECTION(".fd_anchor_rodata") = {\n')
        for i, (cx, cy) in enumerate(anchors):
            f.write(f"  {{{cx:.9f}f, {cy:.9f}f}},")
            f.write("\n" if (i + 1) % 4 == 0 else " ")
        f.write("};\n")

4.2 Flash 与 SRAM 分区

由于模型输入图片、anchor 数据和 activation buffer 均较大，需要在ld脚本对 Flash 和 SRAM 进行划分。

Flash 区域

测试图片和 anchor 数据被放入独立 Flash 区域：

区域	地址	用途
FD_IMAGE_FLASH	0x081B8000	128×128×3 float32 图片数组
FD_ANCHOR_FLASH	0x081F8000	896 个 anchor
FLASH	0x08000000 起	程序代码、模型权重、只读数据

实际运行日志如下：

CSV,fdbench,image_addr,0x081b8000,width,128,height,128,float_count,49152
CSV,fdbench,anchor_addr,0x081f8000,count,896

其中图片数组包含 49,152 个 float：

128 × 128 × 3 = 49,152

anchor 数组包含 896 个 anchor，对应 BlazeFace 的两个输出尺度：

16×16×2 + 8×8×6 = 512 + 384 = 896

ld脚本设置:

  .fd_image_rodata : ALIGN(32)
  {
    . = ALIGN(32);
    __fd_image_flash_start__ = .;
    *(.fd_image_rodata)
    *(.fd_image_rodata*)
    . = ALIGN(32);
    __fd_image_flash_end__ = .;
  } >FD_IMAGE_FLASH
  ASSERT(SIZEOF(.fd_image_rodata) <= LENGTH(FD_IMAGE_FLASH), "FaceDetect image data overflow FD_IMAGE_FLASH")

  .fd_anchor_rodata : ALIGN(32)
  {
    . = ALIGN(32);
    __fd_anchor_flash_start__ = .;
    *(.fd_anchor_rodata)
    *(.fd_anchor_rodata*)
    . = ALIGN(32);
    __fd_anchor_flash_end__ = .;
  } >FD_ANCHOR_FLASH
  ASSERT(SIZEOF(.fd_anchor_rodata) <= LENGTH(FD_ANCHOR_FLASH), "FaceDetect anchor data overflow FD_ANCHOR_FLASH")

SRAM 区域

activation buffer 放置在 SRAM 的 AI_RAM 区域：

backend	activation 地址	activation 大小
HSP	0x2003F000	327,680 B
CPU	0x2003F000	328,056 B

HSP 版运行日志：

CSV,fdbench,activation_addr,0x2003f000,size,327680

CPU 版运行日志：

CSV,fdbench,activation_addr,0x2003f000,size,328056

ld脚本设置:

  .ai_activations (NOLOAD) : ALIGN(32)
  {
    . = ALIGN(32);
    __ai_activations_start__ = .;
    KEEP(*(.ai_activations))
    KEEP(*(.ai_activations*))
    KEEP(*(.AI_RAM))
    KEEP(*(.AI_RAM*))
    . = ALIGN(32);
    __ai_activations_end__ = .;
  } >AI_RAM
  ASSERT(SIZEOF(.ai_activations) <= LENGTH(AI_RAM), "AI activations overflow AI_RAM")
  ASSERT(__ai_activations_end__ <= ORIGIN(AI_RAM) + LENGTH(AI_RAM), "AI activations exceed AI_RAM end")

  .HSP_DATA_BRAM :
  {
    __section_static_hsp_data_bram_start__ = .;
    *(HSP_DATA_BRAM)
    __section_static_hsp_data_bram_end__ = .;
  } >HSP_DATA_BRAM

4.3 ST EdgeAI 推理

查看生成的network.h可以发现，和大部分神经网络推理一样，st-api推理流程同样遵循：

init -> set_input -> set_output -> infer -> get_output -> deinit

模仿example，我们的人脸检测模型实现如下：

void FaceDetect_Bench_RunOnce(void)
{
  stai_network *net = (stai_network *)g_net_ctx;
  stai_return_code rc;
  stai_ptr activations[STAI_NETWORK_ACTIVATIONS_NUM] = { (stai_ptr)g_net_activations };
  stai_ptr inputs[STAI_NETWORK_IN_NUM] = { 0 };
  stai_ptr outputs[STAI_NETWORK_OUT_NUM] = { 0 };
  stai_size n_inputs = 0;
  stai_size n_outputs = 0;
  uint64_t sum_cycles = 0;
  uint32_t min_cycles = 0xFFFFFFFFu;
  uint32_t max_cycles = 0;
  uint32_t hclk_hz = HAL_RCC_GetHCLKFreq();

  printf("\r\n[FD] FaceDetect benchmark start\r\n");
  printf("CSV,fdbench,backend,%s\r\n", bench_backend_name());
  printf("CSV,fdbench,hclk_hz,%lu\r\n", (unsigned long)hclk_hz);
  printf("CSV,fdbench,nodes,%lu,macc,%lu,act_bytes,%lu,weights_bytes,%lu,input_bytes,%lu,output_bytes,%lu\r\n",
         (unsigned long)STAI_NETWORK_NODES_NUM,
         (unsigned long)STAI_NETWORK_MACC_NUM,
         (unsigned long)STAI_NETWORK_ACTIVATIONS_SIZE_BYTES,
         (unsigned long)STAI_NETWORK_WEIGHTS_SIZE_BYTES,
         (unsigned long)STAI_NETWORK_IN_SIZE_BYTES,
         (unsigned long)STAI_NETWORK_OUT_SIZE_BYTES);
  printf("CSV,fdbench,activation_addr,0x%08lx,size,%lu\r\n",
         (unsigned long)(uintptr_t)g_net_activations,
         (unsigned long)STAI_NETWORK_ACTIVATIONS_SIZE_BYTES);
  printf("CSV,fdbench,image_addr,0x%08lx,width,%lu,height,%lu,float_count,%lu\r\n",
         (unsigned long)(uintptr_t)g_fd_image_rgb128_f32,
         (unsigned long)g_fd_image_width,
         (unsigned long)g_fd_image_height,
         (unsigned long)FD_IMAGE_FLOAT_COUNT);
  printf("CSV,fdbench,anchor_addr,0x%08lx,count,%lu\r\n",
         (unsigned long)(uintptr_t)g_fd_anchors,
         (unsigned long)g_fd_anchor_count);

  memset(g_net_ctx, 0, sizeof(g_net_ctx));
  memset((void *)g_net_activations, 0, STAI_NETWORK_ACTIVATIONS_SIZE_BYTES);

  rc = stai_network_init(net);
  if (rc != STAI_SUCCESS)
  {
    printf("CSV,fdbench,error,stai_network_init,%ld\r\n", (long)rc);
    return;
  }

  rc = stai_network_set_activations(net, activations, STAI_NETWORK_ACTIVATIONS_NUM);
  if (rc != STAI_SUCCESS)
  {
    printf("CSV,fdbench,error,stai_network_set_activations,%ld\r\n", (long)rc);
    return;
  }

  rc = stai_network_get_inputs(net, inputs, &n_inputs);
  if ((rc != STAI_SUCCESS) || (n_inputs != STAI_NETWORK_IN_NUM) || (inputs[0] == 0))
  {
    printf("CSV,fdbench,error,stai_network_get_inputs,%ld,%lu\r\n", (long)rc, (unsigned long)n_inputs);
    return;
  }

  rc = stai_network_get_outputs(net, outputs, &n_outputs);
  if ((rc != STAI_SUCCESS) || (n_outputs != STAI_NETWORK_OUT_NUM))
  {
    printf("CSV,fdbench,error,stai_network_get_outputs,%ld,%lu\r\n", (long)rc, (unsigned long)n_outputs);
    return;
  }

#if FACEDETECT_ENABLE_NODE_TRACE
  rc = stai_network_set_callback(net, (stai_event_cb)fd_stai_node_trace_cb, NULL);
  if (rc != STAI_SUCCESS)
  {
    printf("CSV,fdbench,error,stai_network_set_callback,%ld\r\n", (long)rc);
    return;
  }
  printf("CSV,fdbench,node_trace,enabled,max_buffers,%lu\r\n",
         (unsigned long)FACEDETECT_NODE_TRACE_MAX_BUFFERS);
#else
  printf("CSV,fdbench,node_trace,disabled\r\n");
#endif

  printf("CSV,fdbench,io_addr,input0,0x%08lx,out0,0x%08lx,out1,0x%08lx,out2,0x%08lx,out3,0x%08lx\r\n",
         (unsigned long)(uintptr_t)inputs[0],
         (unsigned long)(uintptr_t)outputs[0],
         (unsigned long)(uintptr_t)outputs[1],
         (unsigned long)(uintptr_t)outputs[2],
         (unsigned long)(uintptr_t)outputs[3]);

  if (STAI_NETWORK_IN_1_SIZE != FD_IMAGE_FLOAT_COUNT)
  {
    printf("CSV,fdbench,error,input_size_mismatch,model,%lu,image,%lu\r\n",
           (unsigned long)STAI_NETWORK_IN_1_SIZE,
           (unsigned long)FD_IMAGE_FLOAT_COUNT);
    return;
  }

  dwt_counter_init();

  for (uint32_t i = 0; i < FACEDETECT_BENCH_WARMUP; ++i)
  {
    copy_flash_image_to_input((float *)inputs[0], STAI_NETWORK_IN_1_SIZE);
    printf("CSV,fdbench,run_begin,phase,warmup,index,%lu\r\n", (unsigned long)i);
    rc = stai_network_run(net, (stai_run_mode)0);
    printf("CSV,fdbench,run_end,phase,warmup,index,%lu,rc,%ld\r\n", (unsigned long)i, (long)rc);
    if (rc != STAI_SUCCESS)
    {
      printf("CSV,fdbench,error,warmup_run,%lu,%ld\r\n", (unsigned long)i, (long)rc);
      return;
    }
  }

  for (uint32_t i = 0; i < FACEDETECT_BENCH_ITERS; ++i)
  {
    copy_flash_image_to_input((float *)inputs[0], STAI_NETWORK_IN_1_SIZE);
    printf("CSV,fdbench,run_begin,phase,bench,index,%lu\r\n", (unsigned long)i);
    uint32_t t0 = dwt_counter_get();
    rc = stai_network_run(net, (stai_run_mode)0);
    uint32_t t1 = dwt_counter_get();
    uint32_t dt = dwt_counter_elapsed(t0, t1);
    printf("CSV,fdbench,run_end,phase,bench,index,%lu,rc,%ld,cycles,%lu\r\n",
           (unsigned long)i, (long)rc, (unsigned long)dt);

    if (rc != STAI_SUCCESS)
    {
      printf("CSV,fdbench,error,bench_run,%lu,%ld\r\n", (unsigned long)i, (long)rc);
      return;
    }

    sum_cycles += (uint64_t)dt;
    if (dt < min_cycles) min_cycles = dt;
    if (dt > max_cycles) max_cycles = dt;
  }

  uint32_t avg_cycles = (uint32_t)(sum_cycles / (uint64_t)FACEDETECT_BENCH_ITERS);

  printf("CSV,fdbench,result,backend,%s,iters,%lu,avg_cycles,%lu,min_cycles,%lu,max_cycles,%lu,",
         bench_backend_name(),
         (unsigned long)FACEDETECT_BENCH_ITERS,
         (unsigned long)avg_cycles,
         (unsigned long)min_cycles,
         (unsigned long)max_cycles);
  print_ms_from_cycles("avg_ms", avg_cycles, hclk_hz);
  printf(",");
  print_ms_from_cycles("min_ms", min_cycles, hclk_hz);
  printf(",");
  print_ms_from_cycles("max_ms", max_cycles, hclk_hz);
  printf("\r\n");

  for (uint32_t o = 0; o < STAI_NETWORK_OUT_NUM; ++o)
  {
    const float *p = (const float *)outputs[o];
    float s = 0.0f;
    uint32_t n = 16u;
    for (uint32_t i = 0; i < n; ++i)
    {
      s += p[i];
    }
    printf("CSV,fdbench,out_checksum,%lu,%ld\r\n", (unsigned long)o, (long)(s * 1000000.0f));
  }

#if FACEDETECT_ENABLE_DECODE
  (void)fd_postprocess(outputs);
#endif

  (void)stai_network_deinit(net);
}

4.4 后处理

模型输出包含两组 feature map，对应 512 anchors 和 384 anchors。

MCU推理完成后，读取输出张量，对张量进行解码，以及NMS，输出最终人脸框。

static uint32_t fd_postprocess(const stai_ptr outputs[STAI_NETWORK_OUT_NUM])
{
  const float *boxes_512  = (const float *)outputs[0];
  const float *scores_512 = (const float *)outputs[1];
  const float *scores_384 = (const float *)outputs[2];
  const float *boxes_384  = (const float *)outputs[3];
  uint32_t cand_count = 0;
  uint32_t det_count = 0;
  const float score_thr = (float)FACEDETECT_SCORE_THR_MILLI / 1000.0f;
  const float nms_thr = (float)FACEDETECT_NMS_THR_MILLI / 1000.0f;

  if ((g_fd_anchor_count != FD_ANCHOR_COUNT) ||
      (boxes_512 == NULL) || (scores_512 == NULL) ||
      (scores_384 == NULL) || (boxes_384 == NULL))
  {
    printf("CSV,face,error,invalid_postprocess_input\r\n");
    return 0;
  }

  for (uint32_t i = 0; i < 512u; ++i)
  {
    fd_candidate_t c;
    if (fd_decode_one(&boxes_512[i * 16u], scores_512[i], &g_fd_anchors[i], &c, score_thr))
    {
      fd_insert_candidate(c, g_candidates, &cand_count);
    }
  }

  for (uint32_t i = 0; i < 384u; ++i)
  {
    fd_candidate_t c;
    if (fd_decode_one(&boxes_384[i * 16u], scores_384[i], &g_fd_anchors[512u + i], &c, score_thr))
    {
      fd_insert_candidate(c, g_candidates, &cand_count);
    }
  }

  /* Sort candidates by score descending. */
  for (uint32_t i = 0; i < cand_count; ++i)
  {
    uint32_t best = i;
    for (uint32_t j = i + 1u; j < cand_count; ++j)
    {
      if (g_candidates[j].score > g_candidates[best].score)
      {
        best = j;
      }
    }
    if (best != i)
    {
      fd_candidate_t tmp = g_candidates[i];
      g_candidates[i] = g_candidates[best];
      g_candidates[best] = tmp;
    }
  }

  for (uint32_t i = 0; (i < cand_count) && (det_count < FACEDETECT_MAX_DETECTIONS); ++i)
  {
    uint32_t keep = 1u;
    for (uint32_t j = 0; j < det_count; ++j)
    {
      if (fd_iou(&g_candidates[i], &g_detections[j]) > nms_thr)
      {
        keep = 0u;
        break;
      }
    }
    if (keep)
    {
      g_detections[det_count++] = g_candidates[i];
    }
  }

  printf("CSV,face,count,%lu,candidates,%lu,score_thr_milli,%lu,nms_thr_milli,%lu\r\n",
         (unsigned long)det_count,
         (unsigned long)cand_count,
         (unsigned long)FACEDETECT_SCORE_THR_MILLI,
         (unsigned long)FACEDETECT_NMS_THR_MILLI);

  for (uint32_t i = 0; i < det_count; ++i)
  {
    uint32_t score_milli = (uint32_t)(g_detections[i].score * 1000.0f + 0.5f);
    uint32_t x0_q10000 = (uint32_t)(g_detections[i].x0 * 10000.0f + 0.5f);
    uint32_t y0_q10000 = (uint32_t)(g_detections[i].y0 * 10000.0f + 0.5f);
    uint32_t x1_q10000 = (uint32_t)(g_detections[i].x1 * 10000.0f + 0.5f);
    uint32_t y1_q10000 = (uint32_t)(g_detections[i].y1 * 10000.0f + 0.5f);
    uint32_t x0_px = (uint32_t)(g_detections[i].x0 * (float)FD_INPUT_W + 0.5f);
    uint32_t y0_px = (uint32_t)(g_detections[i].y0 * (float)FD_INPUT_H + 0.5f);
    uint32_t x1_px = (uint32_t)(g_detections[i].x1 * (float)FD_INPUT_W + 0.5f);
    uint32_t y1_px = (uint32_t)(g_detections[i].y1 * (float)FD_INPUT_H + 0.5f);

    printf("CSV,face,%lu,score_milli,%lu,x0_q10000,%lu,y0_q10000,%lu,x1_q10000,%lu,y1_q10000,%lu,x0_px,%lu,y0_px,%lu,x1_px,%lu,y1_px,%lu\r\n",
           (unsigned long)i,
           (unsigned long)score_milli,
           (unsigned long)x0_q10000,
           (unsigned long)y0_q10000,
           (unsigned long)x1_q10000,
           (unsigned long)y1_q10000,
           (unsigned long)x0_px,
           (unsigned long)y0_px,
           (unsigned long)x1_px,
           (unsigned long)y1_px);
  }

  return det_count;
}

5. 测试结果

5.1 CPU 测试结果

CPU backend 运行日志如下：

CPU backend 连续运行 5 次，结果稳定：

指标	数值
平均 cycles	101,803,366
最小 cycles	101,803,333
最大 cycles	101,803,405
平均耗时	1060.45 ms
最小耗时	1060.45 ms
最大耗时	1060.45 ms

CPU backend 的 4 个输出张量 checksum：

CSV,fdbench,out_checksum,0,51526964
CSV,fdbench,out_checksum,1,-57252232
CSV,fdbench,out_checksum,2,-371083584
CSV,fdbench,out_checksum,3,98565056

5.2 HSP 测试结果

HSP backend 运行日志如下：

CSV,fdbench,backend,HSP
CSV,fdbench,hclk_hz,96000000
CSV,fdbench,nodes,94,macc,31849356,act_bytes,327680,weights_bytes,107956,input_bytes,196612,output_bytes,60932
CSV,fdbench,node_trace,disabled
CSV,fdbench,result,backend,HSP,iters,5,avg_cycles,44581011,min_cycles,44580779,max_cycles,44581176,avg_ms,464.38,min_ms,464.38,max_ms,464.38

HSP backend 连续运行 5 次，结果同样稳定：

指标	数值
平均 cycles	44,581,011
最小 cycles	44,580,779
最大 cycles	44,581,176
平均耗时	464.38 ms
最小耗时	464.38 ms
最大耗时	464.38 ms

HSP backend 的 4 个输出张量 checksum：

CSV,fdbench,out_checksum,0,51526964
CSV,fdbench,out_checksum,1,-57252232
CSV,fdbench,out_checksum,2,-371083584
CSV,fdbench,out_checksum,3,98565056

可以看到，HSP backend 与 CPU backend 的 4 个输出 checksum 完全一致，说明在当前输入下，HSP backend 与 CPU backend 的推理结果一致。

5.3 HSP 与 CPU 性能对比

项目	CPU backend	HSP backend
nodes	110	94
MACC	31,849,356	31,849,356
activation	328,056 B	327,680 B
平均 cycles	101,803,366	44,581,011
平均耗时	1060.45 ms	464.38 ms
输出 checksum	一致	一致
检测框结果	一致	一致

HSP 加速比计算如下：

speedup = CPU avg time / HSP avg time
        = 1060.45 ms / 464.38 ms
        ≈ 2.28x

推理耗时降低比例：

reduction = 1 - HSP avg time / CPU avg time
          = 1 - 464.38 / 1060.45
          ≈ 56.2%

因此，在当前 FaceDetect forward-only 测试中，HSP backend 相比 CPU backend 约有 2.28 倍加速，推理耗时降低约 56.2%。

5.4 最终人脸检测结果

本次测试图片上，CPU backend 与 HSP backend 均输出 2 个最终检测框，结果完全一致：

CSV,face,count,2,candidates,5,score_thr_milli,750,nms_thr_milli,300
CSV,face,0,score_milli,864,x0_q10000,3104,y0_q10000,1543,x1_q10000,5069,y1_q10000,3507,x0_px,40,y0_px,20,x1_px,65,y1_px,45
CSV,face,1,score_milli,850,x0_q10000,4668,y0_q10000,3971,x1_q10000,7447,y1_q10000,6750,x0_px,60,y0_px,51,x1_px,95,y1_px,86

检测框结果如下：

Face	Score	Normalized Box	Pixel Box
0	0.864	x0=0.3104, y0=0.1543, x1=0.5069, y1=0.3507	(40,20)-(65,45)
1	0.850	x0=0.4668, y0=0.3971, x1=0.7447, y1=0.6750	(60,51)-(95,86)

其中，q10000 表示归一化坐标乘以 10000 后的整数表示。例如：

x0_q10000 = 3104  =>  x0 = 0.3104

对于 128×128 输入图像，像素坐标计算方式为：

x_px = x_norm × 128
y_px = y_norm × 128

小结

本次测试完成了 STM32U3C5 上基于 HSP 的轻量级人脸检测部署与实测, 测试结果表明，STM32U3C5 可以运行 128×128 输入的轻量级 FaceDetect 模型， HSP backend 与 CPU backend 的最终检测框数量、坐标和置信度完全一致。 HSP backend 相比 CPU backend 实现约 2.28 倍加速，推理耗时降低约 56.2%。

综合来看，STM32U3C5 的 HSP 可以有效加速轻量级 CNN 推理任务，并且在当前人脸检测测试中保持与 CPU backend 一致的检测结果。对于需要在低功耗 MCU 上执行图像类边缘 AI 的场景，HSP 可以显著降低神经网络 forward 延迟，为后续摄像头实时检测、低功耗间歇唤醒检测和本地 AI 预筛选提供了可行基础。