追求性能第一部分：内联 C、OpenMP 和 Perl 数据语言 (PDL)- 技术经验 -卓越飞翔博客

追求性能第一部分：内联 c、openmp 和 perl 数据语言 (pdl)

有时，一个人的代码必须简单地执行，而诸如美观、“聪明”或对单一语言解决方案的承诺之类的原则则完全不适用。
在 tprc 我做了一个演讲（这里是幻灯片）关于如何做到这一点
可以针对生物信息学应用程序完成，但我认为有必要使用一个更简单的示例来说明最大化 perl 性能的潜在场所
程序员在数据密集型应用程序中工作时可以随意使用。

所以这是一个玩具问题来说明这些选项。给定一个非常大的双精度浮点数组使用以下函数将它们变换：cos(sin(sqrt(x)))。
该函数有 3 个嵌套的浮点运算。这是一个评估成本高昂的函数，尤其是在必须计算大量值的情况下。我们可以合理生成
使用以下代码快速获取 perl 中的数组值（以及我们将要检查的解决方案的一些副本）：

my $num_of_elements = 50_000_000;
my @array0 = map { rand } 1 .. $num_of_elements;    ## generate random numbers
my @array1 = @array0;                               ## copy the array
my @array2 = @array0;                               ## another copy
my @array3 = @array0;                               ## yet another copy
my @rray4  = @array0;                               ## the last? copy
my $array_in_pdl      = pdl(@array0);    ## convert the array to a pdl ndarray
my $array_in_pdl_copy = $array_in_pdl-&gt;copy;    ## copy the pdl ndarray

可能的解决方案包括以下：

在 perl 中使用 for 循环进行就地修改。

for my $elem (@array0) {
    $elem = cos( sin( sqrt($elem) ) );
}

使用内联 c 代码遍历数组并在 c 中就地转换。 。有效地使用 c 进行就地映射。在 c 中访问 perl 数组（c 中的 av*）的元素尤其如此
如果使用 perl 5.36 及更高版本，则性能更高，因为该版本的 perl 中引入了优化的获取函数。

void map_in_c(av *array) {
  int len = av_len(array) + 1;
  for (int i = 0; i 



<p><strong>使用内联 c 代码来转换数组，但将转换分解为 3 个连续的 c for 循环。</strong> 这是一个真正关于权衡的实验：现代 x86 处理器有一个专门的，<br>
向量化平方根指令，因此编译器也许可以弄清楚如何使用它来加速至少一部分计算。另一方面，我们将降低算术强度<br>
每个循环并访问相同的数据值两次，因此可能会为这些重复的<a style="color:#f60; text-decoration:underline;" href="https://www.php.cn/zt/35234.html" target="_blank">数据访问</a>付出代价。<br></p>

<pre class="brush:php;toolbar:false">void map_in_c_sequential(av *array) {
  int len = av_len(array) + 1;
  for (int i = 0; i 



<p><strong>使用 openmp 并行化 c 函数循环。</strong> 在上一篇文章中，我们讨论了如何从 perl 中控制 openmp 环境并编译 openmp 感知的 inline::c 代码<br>
由 perl 使用，所以让我们将这些知识付诸实践！在程序的 perl 方面，我们将这样做：<br></p>

<pre class="brush:php;toolbar:false">use v5.38;
use alien::openmp;
use openmp::environment;
use inline (
    c    =&gt; 'data',
    with =&gt; qw/alien::openmp/,
);
my $env = openmp::environment-&gt;new();
my $threads_or_workers = 8; ## or any other value
## modify number of threads and make c aware of the change
$env-&gt;omp_num_threads($threads_or_workers);
_set_num_threads_from_env();

## modify runtime schedule and make c aware of the change
$env-&gt;omp_schedule("guided,1");    ## modify runtime schedule
_set_openmp_schedule_from_env();

在程序的 c 部分，我们将执行此操作（已经讨论了 openmp 环境的辅助函数
之前，因此这里不再重复）。

#include <omp.h>
void map_in_c_using_omp(av *array) {
  int len = av_len(array) + 1;
#pragma omp parallel
  {
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i 



<p><strong>perl 数据语言 (pdl) 可以拯救你。</strong> pdl 模块集是另一种加速操作的方法，可以将程序员从 c 语言中解救出来。它还能在给定正确指令的情况下自动并行化，所以为什么不使用它呢？<br></p>

<pre class="brush:php;toolbar:false">use pdl;
## set the minimum size problem for autothreading in pdl
set_autopthread_size(0);
my $threads_or_workers = 8; ## or any other value

## pdl
## use pdl to modify the array - multi threaded
set_autopthread_targ($threads_or_workers);
$array_in_pdl-&gt;inplace-&gt;sqrt;
$array_in_pdl-&gt;inplace-&gt;sin;
$array_in_pdl-&gt;inplace-&gt;cos;


## use pdl to modify the array - single thread
set_autopthread_targ(0);

$array_in_pdl_copy-&gt;inplace-&gt;sqrt;
$array_in_pdl_copy-&gt;inplace-&gt;sin;
$array_in_pdl_copy-&gt;inplace-&gt;cos;

使用8个线程我们得到这样的东西

inplace benchmarks
inplace  in         perl took 2.85 seconds
inplace  in perl/mapcseq took 1.62 seconds
inplace  in    perl/mapc took 1.54 seconds
inplace  in   perl/c/omp took 0.24 seconds

pdl benchmarks
inplace  in     pdl - st took 0.94 seconds
inplace  in     pdl - mt took 0.17 seconds

使用16个线程我们得到了这个！

Starting the benchmark for 50000000 elements using 16 threads/workers

Inplace benchmarks
Inplace  in         Perl took 3.00 seconds
Inplace  in Perl/mapCseq took 1.72 seconds
Inplace  in    Perl/mapC took 1.62 seconds
Inplace  in   Perl/C/OMP took 0.13 seconds

PDL benchmarks
Inplace  in     PDL - ST took 0.99 seconds
Inplace  in     PDL - MT took 0.10 seconds

一些观察：

openmp 和 pdl 的多线程 (mt) 会响应工作线程的数量，而解决方案则不会。因此，这些基准测试中纯 perl 和内联非 openmp 解决方案的时序给出了性能自然变化的想法
用 c 语言编写地图版本的代码，性能提高了约 180%（对比 perl 和 perl/mapc）。
在单线程中使用 pdl 性能提高了 285-300%（对比 pdl - st 和 perl 计时）。

总之，在 perl 中，有原生（pdl 模块）和外来（c/openmp）解决方案来加速数据密集型操作，那么为什么不广泛而明智地使用它们来提高 perl 程序的性能呢？

相关推荐