线程彼此减慢

提问者：小点点

我有一些昂贵的计算，我想在一组线程上划分和分布。我将代码简化为一个最小的示例，在这个示例中，这种情况仍在发生。

简而言之:

我有N个任务，我想把它们分成“线程”线程。

每个任务就是下面运行一堆简单数学运算的简单函数。（在实践中，我在这里验证了非对称签名，但为了简化起见，我将其排除在外）

while (i++ < 100000)
        {
            for (int y = 0; y < 1000; y++)
            {
                sqrt(y);
            }
        }

用1个线程运行上述代码会导致每个操作0.36秒（最外层为循环），因此总体执行时间约为36秒。

因此，并行化似乎是一种明显的加速方法。但是，使用两个线程时，操作时间上升到0.72秒，完全破坏了任何速度的提高。

添加更多线程通常会导致性能越来越差。

我得到了一个Intel(R)Core(TM)i7-8750H CPU@2.20GHz，有6个物理核心。因此，我希望在从1线程到2线程的情况下，至少使用它可以提高性能。但实际上每个操作在增加线程数时都会变慢。

我是不是做错了什么？

完整代码:

using namespace std;

const size_t N = 100;
const size_t Threads = 1;

atomic_int counter(0);

struct ThreadData
{
    int index;
    int count;

    ThreadData(const int index, const int count): index(index), count(count){};
};

void *executeSlave(void *threadarg)
{
    struct ThreadData *my_data;
    my_data = static_cast<ThreadData *>(threadarg);
    for( int x = my_data->index; x < my_data->index + my_data->count; x++ )
    {
        cout << "Thread: " << my_data->index <<  ": " << x << endl;

        clock_t start, end;
        start = clock();
        int i = 0;

        while (i++ < 100000)
        {
            for (int y = 0; y < 1000; y++)
            {
                sqrt(y);
            }
        }
        counter.fetch_add(1);

        end = clock();
        cout << end - start << ':' << CLOCKS_PER_SEC << ':' << (((float) end - start) / CLOCKS_PER_SEC)<< endl;
    }

    pthread_exit(NULL);
}

int main() 
{
    clock_t start, end;
    start = clock();

    pthread_t threads[Threads];
    vector<ThreadData> td;
    td.reserve(Threads);
    int each = N / Threads;
    cout << each << endl;
    for (int x = 0; x < Threads; x++) {
        cout << "main() : creating thread, " << x << endl;
        td[x] = ThreadData(x * each, each);

        int rc = pthread_create(&threads[x], NULL, executeSlave, (void *) &td[x]);

        if (rc) {
            cout << "Error:unable to create thread," << rc << endl;
            exit(-1);
        }
    }

    while (counter < N) {
        std::this_thread::sleep_for(10ms);
    }

    end = clock();

    cout << "Final:" << endl;
    cout << end - start << ':' << CLOCKS_PER_SEC << ':' << (((float) end - start) / CLOCKS_PER_SEC)
         << endl;

}

共1个答案

匿名用户

clock()返回整个进程的大约CPU时间。

最外面的循环每次迭代完成固定量的工作

    int i = 0;
    while (i++ < 100000)
    {
        for (int y = 0; y < 1000; y++)
        {
            sqrt(y);
        }
    }

因此，围绕这个循环报告的进程CPU时间将与正在运行的线程数成正比（它仍然花费每个线程相同的时间量，乘以N个线程）。

改用std::chrono::steady_clock测量挂钟时间。还要注意，诸如std::cout之类的I/O会占用很多挂钟时间，而且不稳定。因此测量到的总经过时间将会由于内部的I/O而发生倾斜。

一些补充说明:

从不使用sqrt()的返回值；编译器可以完全消除调用。谨慎的做法是以某种方式使用该值来确定。

void*executeSlave()没有返回void*指针值(UB)。如果它不返回任何东西，可能应该简单地声明为void。

td.reserve(Threads)保留内存，但不分配对象。 td[x]然后访问不存在的对象(UB)。使用td.emplace_back(x*each，eace)而不是td[x]=...。

技术上不是一个问题，但建议使用标准C++std::thread而不是pthread，以获得更好的可移植性。

通过以下方式，我看到了与线程数成正比的正确加速比:

#include <string>
#include <iostream>
#include <vector>
#include <atomic>
#include <cmath>
#include <thread>

using namespace std;
using namespace std::chrono_literals;

const size_t N = 12;
const size_t Threads = 2;

std::atomic<int> counter(0);
std::atomic<int> xx{ 0 };

void executeSlave(int index, int count, int n)
{
    double sum = 0;
    for (int x = index; x < index + count; x++)
    {
        cout << "Thread: " << index << ": " << x << endl;
        auto start = std::chrono::steady_clock::now();
        for (int i=0; i < 100000; i++)
        {
            for (int y = 0; y < n; y++)
            {
                sum += sqrt(y);
            }
        }
        counter++;

        auto end = std::chrono::steady_clock::now();
        cout << 1e-6 * (end - start) / 1us << " s" << endl;
    }
    xx += (int)sum; // prevent optimization

}

int main()
{
    std::thread threads[Threads];
    int each = N / Threads;
    cout << each << endl;
    auto start = std::chrono::steady_clock::now();
    for (int x = 0; x < Threads; x++) {
        cout << "main() : creating thread, " << x << endl;
        threads[x] = std::thread(executeSlave, x * each, each, 100);
    }

    for (auto& t : threads) {
        t.join();
    }

    auto end = std::chrono::steady_clock::now();

    cout << "Final:" << endl;
    cout << 1e-6 * (end - start) / 1us << " s" << endl;

}