提问者:小点点

检查远程文件(图像)存在的最快方法


我已经编写了一个产品同步脚本,在运行商家应用程序的本地服务器和托管商店eshop的远程web服务器之间。。。

对于完全同步选项,我需要同步大约5000个产品及其图像等。。。即使同一产品的尺寸变化(不同的产品尺寸-例如鞋子)共享相同的产品图像,我也需要检查是否存在大约3500个图像。。。

所以,在第一次运行时,我通过FTP上传了除了其中两个之外的所有产品图像,并让脚本运行以检查它是否会上传丢失的两个图像。。。

问题是脚本运行了4个小时,这是不可接受的...我是说,我没有重新上传每一张照片...它只是检查每一张图像,以确定它是否会跳过或上传它(通过ftp_put())。

我是这样进行检查的:

if(stripos(get_头(DESTINATION_URL.{$path}/{$file})[0],'200 OK')==false){

这是相当快,但显然不够快的同步运行了一段合理的时间。。。

在需要检查大量远程文件的情况下,您如何处理这些情况?

作为最后的手段,我已经离开使用ftp_nlist()下载远程文件列表,然后编写一个算法,或多或少在本地和远程文件之间进行文件比较...

我试过了,递归算法构建文件列表需要花费很多时间,确切地说是30分钟。。。你看,我的文件不在一个文件夹中。。。整个树跨越1956个文件夹,文件列表由3653个产品图像文件和不断增长的。。。还要注意的是,我甚至没有使用大小“技巧”(与ftp\u nlist())来确定文件是文件还是文件夹,而是使用了较新的ftp\u mlsd(),它显式返回保存该信息的类型参数。。。您可以在这里阅读更多内容:PHP FTP递归目录列表


共1个答案

匿名用户

curl_multi可能是最快的方法。不幸的是curl_multi很难使用,一个例子帮助了很多imo。在加拿大的2个不同的数据中心检查2x 1gbps专用服务器之间的网址,这个脚本每秒检查大约3000个网址,使用500个并发tcp连接(通过重新使用卷曲手柄而不是打开关闭,它可以变得更快)

<?php
declare(strict_types=1);
$urls=array();
for($i=0;$i<100000;++$i){
    $urls[]="http://ratma.net/";
}
validate_urls($urls,500,1000,false,false,false);    
// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms=curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                    }
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "3") {
                        if ($consider_http_300_redirect_as_error) {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                            }
                        } else {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        }
                    } elseif ($code[0] === "2") {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                        } else {
                            $ret[] = $workers[(int)$info['handle']];
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                        }
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(false, -1, "curl_init() failed");
            }
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}