性能评估#

参考:profiler & tensorboard_profiler

这个菜谱解释了如何使用 PyTorch profiler 并度量模型运算的时间和内存消耗。

import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

定义简单的模型:

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)

时间分析#

PyTorch profiler 是通过上下文管理器启用的,并接受一些参数,其中一些最有用的是:

  • activities (profile 的活动列表):

    • ProfilerActivity.CPU:PyTorch 算子,TorchScript 函数和用户定义的代码标签(参见下面的 record_function);

    • ProfilerActivity.CUDA:CUDA 内核。

  • record_shapes:是否记录算子输入的形状;

  • profile_memory:是否报告模型的张量消耗内存量;

  • use_cuda:是否度量 CUDA 内核的执行时间。

备注

当使用 CUDA 时,分析器还会显示主机上发生的 CUDA 运行时事件。

看看如何使用 profiler 来分析执行时间:

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        model(inputs)

备注

可以使用 record_function 上下文管理器用用户提供的名称标记任意代码范围(model_inference 在上面的例子中用作 label)。

Profiler 允许检查在使用 Profiler 上下文管理器包装的代码范围执行期间调用了哪些算子。如果多个分析器范围同时处于活动状态(例如在并行的 PyTorch 线程中),每个 profiler 上下文管理器只跟踪其对应范围的算子。Profiler 还会自动剖析使用 torch.jit._fork 启动的异步任务和(在向后传递的情况下)使用 backward() 调用启动的向后传递算子。

打印出上面执行的统计数据:

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  model_inference         8.04%       5.637ms        99.93%      70.030ms      70.030ms             1  
                     aten::conv2d         0.17%     119.000us        72.91%      51.093ms       2.555ms            20  
                aten::convolution         0.43%     301.000us        72.74%      50.974ms       2.549ms            20  
               aten::_convolution         0.28%     194.000us        72.31%      50.673ms       2.534ms            20  
         aten::mkldnn_convolution        71.70%      50.246ms        72.03%      50.479ms       2.524ms            20  
                 aten::batch_norm         0.10%      67.000us        11.56%       8.100ms     405.000us            20  
     aten::_batch_norm_impl_index         0.22%     152.000us        11.46%       8.033ms     401.650us            20  
          aten::native_batch_norm        10.70%       7.500ms        11.20%       7.846ms     392.300us            20  
                 aten::max_pool2d         0.04%      27.000us         3.68%       2.579ms       2.579ms             1  
    aten::max_pool2d_with_indices         3.64%       2.552ms         3.64%       2.552ms       2.552ms             1  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 70.079ms

在这里可以看到,正如预期的那样,大部分时间都花在卷积上(特别是在 mkldnn_convolution 上,因为 PyTorch 编译时支持 MKL-DNN)。注意自定义 cpu 时间和 cpu 时间之间的区别——算子可以调用其他算子,自定义 cpu 时间不包括子算子调用的时间,而总 cpu 时间包括它。通过将 sort_by="self_cpu_time_total" 传递到 table 调用中,可以选择根据自 cpu 时间进行排序。

为了获得更细粒度的结果并包含算子输入形状,传递 group_by_input_shape=True (注意:这需要运行带有 record_shapes=True 的 profiler):

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total", row_limit=10))
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                  model_inference         8.04%       5.637ms        99.93%      70.030ms      70.030ms             1                                                                                []  
                     aten::conv2d         0.03%      22.000us        15.04%      10.543ms       2.636ms             4                             [[5, 64, 56, 56], [64, 64, 3, 3], [], [], [], [], []]  
                aten::convolution         0.08%      58.000us        15.01%      10.521ms       2.630ms             4                     [[5, 64, 56, 56], [64, 64, 3, 3], [], [], [], [], [], [], []]  
               aten::_convolution         0.06%      41.000us        14.93%      10.463ms       2.616ms             4     [[5, 64, 56, 56], [64, 64, 3, 3], [], [], [], [], [], [], [], [], [], [], []]  
         aten::mkldnn_convolution        14.77%      10.348ms        14.87%      10.422ms       2.606ms             4                             [[5, 64, 56, 56], [64, 64, 3, 3], [], [], [], [], []]  
                     aten::conv2d         0.03%      24.000us        13.72%       9.613ms       9.613ms             1                             [[5, 3, 224, 224], [64, 3, 7, 7], [], [], [], [], []]  
                aten::convolution         0.07%      46.000us        13.68%       9.589ms       9.589ms             1                     [[5, 3, 224, 224], [64, 3, 7, 7], [], [], [], [], [], [], []]  
               aten::_convolution         0.03%      22.000us        13.62%       9.543ms       9.543ms             1     [[5, 3, 224, 224], [64, 3, 7, 7], [], [], [], [], [], [], [], [], [], [], []]  
         aten::mkldnn_convolution        13.55%       9.499ms        13.59%       9.521ms       9.521ms             1                             [[5, 3, 224, 224], [64, 3, 7, 7], [], [], [], [], []]  
                     aten::conv2d         0.02%      15.000us        12.70%       8.900ms       2.967ms             3                            [[5, 512, 7, 7], [512, 512, 3, 3], [], [], [], [], []]  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 70.079ms

注意在不同的输入形状下出现两次 aten::convolution

Profiler 还可以用来分析在 GPU 上执行的模型的性能:

model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()

with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_inference"):
        model(inputs)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference         0.14%       2.617ms       100.00%        1.876s        1.876s       0.000us         0.00%       3.562ms       3.562ms             1  
                                           aten::conv2d         0.01%      94.000us        94.85%        1.780s      88.977ms       0.000us         0.00%       2.897ms     144.850us            20  
                                      aten::convolution         0.01%     229.000us        94.85%        1.779s      88.972ms       0.000us         0.00%       2.897ms     144.850us            20  
                                     aten::_convolution         0.01%     143.000us        94.83%        1.779s      88.961ms       0.000us         0.00%       2.897ms     144.850us            20  
                                aten::cudnn_convolution         4.47%      83.840ms        94.83%        1.779s      88.954ms       2.897ms        81.33%       2.897ms     144.850us            20  
volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148...         0.00%       0.000us         0.00%       0.000us       0.000us       1.288ms        36.16%       1.288ms     128.800us            10  
                                   volta_sgemm_64x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us     536.000us        15.05%     536.000us     178.667us             3  
                                       aten::batch_norm         0.00%      46.000us         4.90%      91.996ms       4.600ms       0.000us         0.00%     353.000us      17.650us            20  
                           aten::_batch_norm_impl_index         0.01%      98.000us         4.90%      91.950ms       4.598ms       0.000us         0.00%     353.000us      17.650us            20  
                                 aten::cudnn_batch_norm         0.11%       2.017ms         4.90%      91.852ms       4.593ms     353.000us         9.91%     353.000us      17.650us            20  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.876s
Self CUDA time total: 3.562ms

备注

  • 第一次使用 CUDA 分析可能会带来额外的开销。

  • 输出中出现的设备上内核(例如 sgemm_64x64_nn)。

分析内存消耗#

PyTorch 分析器还可以显示在执行模型算子期间分配(或释放)的内存量(由模型张量使用)。在下面的输出中,‘self’ 内存对应于算子分配(释放)的内存,不包括对其他算子的子调用。通过 profile_memory=True 来启用内存分析功能。

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)

with profile(activities=[ProfilerActivity.CPU],
        profile_memory=True, record_shapes=True) as prof:
    model(inputs)

print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

print(prof.key_averages().table(sort_by="cpu_memory_usage", row_limit=10))
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                      aten::empty         0.83%     522.000us         0.83%     522.000us       2.610us      94.86 Mb      94.86 Mb           200  
    aten::max_pool2d_with_indices         5.94%       3.745ms         5.94%       3.745ms       3.745ms      11.48 Mb      11.48 Mb             1  
                      aten::addmm         0.42%     264.000us         0.45%     286.000us     286.000us      19.53 Kb      19.53 Kb             1  
                       aten::mean         0.06%      39.000us         0.24%     154.000us     154.000us      10.00 Kb      10.00 Kb             1  
              aten::empty_strided         0.01%       5.000us         0.01%       5.000us       5.000us           4 b           4 b             1  
                     aten::conv2d         0.15%      93.000us        77.98%      49.160ms       2.458ms      47.37 Mb           0 b            20  
                aten::convolution         0.44%     278.000us        77.83%      49.067ms       2.453ms      47.37 Mb           0 b            20  
               aten::_convolution         0.31%     197.000us        77.39%      48.789ms       2.439ms      47.37 Mb           0 b            20  
         aten::mkldnn_convolution        76.72%      48.370ms        77.08%      48.592ms       2.430ms      47.37 Mb           0 b            20  
                aten::as_strided_         0.05%      32.000us         0.05%      32.000us       1.600us           0 b           0 b            20  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 63.045ms

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                      aten::empty         0.83%     522.000us         0.83%     522.000us       2.610us      94.86 Mb      94.86 Mb           200  
                 aten::batch_norm         0.10%      62.000us        12.56%       7.920ms     396.000us      47.41 Mb           0 b            20  
     aten::_batch_norm_impl_index         0.19%     119.000us        12.46%       7.858ms     392.900us      47.41 Mb           0 b            20  
          aten::native_batch_norm        11.65%       7.342ms        12.22%       7.705ms     385.250us      47.41 Mb     -75.00 Kb            20  
                     aten::conv2d         0.15%      93.000us        77.98%      49.160ms       2.458ms      47.37 Mb           0 b            20  
                aten::convolution         0.44%     278.000us        77.83%      49.067ms       2.453ms      47.37 Mb           0 b            20  
               aten::_convolution         0.31%     197.000us        77.39%      48.789ms       2.439ms      47.37 Mb           0 b            20  
         aten::mkldnn_convolution        76.72%      48.370ms        77.08%      48.592ms       2.430ms      47.37 Mb           0 b            20  
                 aten::empty_like         0.10%      65.000us         0.23%     144.000us       7.200us      47.37 Mb           0 b            20  
                 aten::max_pool2d         0.01%       5.000us         5.95%       3.750ms       3.750ms      11.48 Mb           0 b             1  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 63.045ms

使用追踪功能#

分析结果可以输出为 .json 追踪文件:

model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    model(inputs)

prof.export_chrome_trace("trace.json")

你可以在 Chrome 跟踪查看器(Chrome://tracing)中检查分析算子和 CUDA 内核的序列。

检查堆栈跟踪#

Profiler 可以用来分析 Python 和 TorchScript 堆栈跟踪:

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True,
) as prof:
    model(inputs)

# Print aggregated stats
print(prof.key_averages(group_by_stack_n=5).table(sort_by="self_cuda_time_total", row_limit=2))
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -----------------------------------------------------------------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  Source Location                                                          
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -----------------------------------------------------------------------  
                                aten::cudnn_convolution        16.43%       2.415ms        74.21%      10.909ms     545.450us       2.942ms        81.41%       2.942ms     147.100us            20  runpy.py(75): _run_code                                                  
                                                                                                                                                                                                     site-packages/ipykernel_launcher.py(12): <module>                        
                                                                                                                                                                                                     site-packages/traitlets/config/application.py(974): launch_instance      
                                                                                                                                                                                                     site-packages/ipykernel/kernelapp.py(702): start                         
                                                                                                                                                                                                     site-packages/tornado/platform/asyncio.py(195): start                    
                                                                                                                                                                                                                                                                              
volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148...         0.00%       0.000us         0.00%       0.000us       0.000us       1.323ms        36.61%       1.323ms     132.300us            10                                                                           
                                                                                                                                                                                                                                                                              
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -----------------------------------------------------------------------  
Self CPU time total: 14.701ms
Self CUDA time total: 3.614ms

注意 torchvision/models/resnet.py 脚本中的两个卷积和两个调用站点。

将数据可视化为 flamegraph#

执行时间(self_cpu_time_totalself_cuda_time_total 指标)和堆栈跟踪也可以可视化为 flamegraph。为此,首先使用 export_stacks 导出原始数据(需要 with_stack=True):

prof.export_stacks("/tmp/profiler_stacks.txt", "self_cuda_time_total")

小技巧

建议使用例如 firegraph 工具来生成交互式 SVG:

# git clone https://github.com/brendangregg/FlameGraph
# cd FlameGraph
# ./flamegraph.pl --title "CUDA time" --countname "us." /tmp/profiler_stacks.txt > perf_viz.svg

分析 long-running jobs#

PyTorch 分析器提供了额外的 API 来处理长时间运行的作业(比如训练循环)。跟踪所有的执行可能会很慢,并导致非常大的跟踪文件。为了避免这种情况,使用可选参数:

  • schedule:指定一个函数,它接受一个整数参数(步数)作为输入并为分析器返回一个 action,使用该参数的最佳方式是使用 torch.profiler.schedule 辅助函数,它可以为您生成一个调度;

  • on_trace_ready:指定一个函数,该函数接受对分析程序的引用作为输入,并在每次新跟踪就绪时由分析程序调用。

from torch.profiler import schedule

my_schedule = schedule(
    skip_first=10,
    wait=5,
    warmup=1,
    active=3,
    repeat=2)
def trace_handler(p):
    output = p.key_averages().table(sort_by="self_cuda_time_total", row_limit=10)
    print(output)
    p.export_chrome_trace("/tmp/trace_" + str(p.step_num) + ".json")

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2),
    on_trace_ready=trace_handler
) as p:
    for idx in range(8):
        model(inputs)
        p.step()
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                aten::cudnn_convolution        23.10%       4.721ms        29.95%       6.122ms     153.050us       5.930ms        80.85%       5.930ms     148.250us            40  
volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148...         0.00%       0.000us         0.00%       0.000us       0.000us       2.639ms        35.98%       2.639ms     131.950us            20  
                                   volta_sgemm_64x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us       1.094ms        14.91%       1.094ms     182.333us             6  
                                 aten::cudnn_batch_norm        10.14%       2.073ms        19.91%       4.070ms     101.750us     746.000us        10.17%     746.000us      18.650us            40  
void cudnn::winograd_nonfused::winogradForwardFilter...         0.00%       0.000us         0.00%       0.000us       0.000us     686.000us         9.35%     686.000us     114.333us             6  
volta_scudnn_128x32_sliced1x4_ldg4_relu_exp_medium_n...         0.00%       0.000us         0.00%       0.000us       0.000us     429.000us         5.85%     429.000us     107.250us             4  
void cudnn::bn_fw_tr_1C11_singleread<float, 512, tru...         0.00%       0.000us         0.00%       0.000us       0.000us     348.000us         4.74%     348.000us      11.600us            30  
                  volta_scudnn_128x64_relu_medium_nn_v1         0.00%       0.000us         0.00%       0.000us       0.000us     344.000us         4.69%     344.000us      86.000us             4  
                                       aten::clamp_min_         2.36%     482.000us         4.47%     914.000us      26.882us     281.000us         3.83%     281.000us       8.265us            34  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     281.000us         3.83%     281.000us       8.265us            34  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 20.440ms
Self CUDA time total: 7.335ms

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                aten::cudnn_convolution        23.45%       3.215ms        30.18%       4.137ms     103.425us       5.790ms        80.60%       5.790ms     144.750us            40  
volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148...         0.00%       0.000us         0.00%       0.000us       0.000us       2.578ms        35.89%       2.578ms     128.900us            20  
                                   volta_sgemm_64x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us       1.061ms        14.77%       1.061ms     176.833us             6  
                                 aten::cudnn_batch_norm        10.26%       1.406ms        20.45%       2.803ms      70.075us     739.000us        10.29%     739.000us      18.475us            40  
void cudnn::winograd_nonfused::winogradForwardFilter...         0.00%       0.000us         0.00%       0.000us       0.000us     676.000us         9.41%     676.000us     112.667us             6  
volta_scudnn_128x32_sliced1x4_ldg4_relu_exp_medium_n...         0.00%       0.000us         0.00%       0.000us       0.000us     416.000us         5.79%     416.000us     104.000us             4  
void cudnn::bn_fw_tr_1C11_singleread<float, 512, tru...         0.00%       0.000us         0.00%       0.000us       0.000us     339.000us         4.72%     339.000us      11.300us            30  
                  volta_scudnn_128x64_relu_medium_nn_v1         0.00%       0.000us         0.00%       0.000us       0.000us     333.000us         4.64%     333.000us      83.250us             4  
                                       aten::clamp_min_         2.27%     311.000us         4.44%     608.000us      17.882us     275.000us         3.83%     275.000us       8.088us            34  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     275.000us         3.83%     275.000us       8.088us            34  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 13.709ms
Self CUDA time total: 7.184ms