当前位置：网站首页 > 技术教程 > 正文

使用U++的简单多请求Web爬网程序完整的数控加工程序必须由

suiw9 2024-11-07 13:23 23 浏览 0 评论

本文使用的是U++框架。请参阅Ultimate++入门以获取有关环境的介绍。

U++框架提供了一个HttpRequest能够异步操作的类。在此示例中，我们将利用此功能使用多达60个并行HTTP连接来构造一个简单的单线程Web搜寻器。

设计GUI

我们将提供一个简单的GUI来显示抓取进度：

首先，我们将为我们的应用程序设计一个简单的GUI布局。在这里，GUI相当简单，但是使用布局设计器仍然值得考虑：

布局包含3ArrayCtrl个小部件，它们基本上是表格。我们将使用工作来显示单个HTTP请求的进度，完成以显示已结束的HTTP请求的结果，并从中获得乐趣和路径，这将为完成的任何行显示从种子url到完成的url的url的“路径”。

现在，让我们使用此布局并在代码中设置一些内容：

#define LAYOUTFILE <GuiWebCrawler/GuiWebCrawler.lay>
#include <CtrlCore/lay.h>

struct WebCrawler : public WithCrawlerLayout<TopWindow> {
    WebCrawler();
};123456复制代码类型：[html]

WebCrawler将是我们应用程序的主要类别。#include在将设计的布局“导入”代码之前，这很奇怪，即它定义WithCrawlerLayout了代表我们布局的模板类。通过从中获取的，我们添加work，finished和pathArrayCtrl小部件为成员变量WebCrawler。我们将完成在WebCrawler构造函数中的设置：

WebCrawler::WebCrawler()
{
    CtrlLayout(*this, "WebCrawler");
    work.AddColumn("URL");
    work.AddColumn("Status");
    finished.AddColumn("Finished");
    finished.AddColumn("Response");
    finished.WhenCursor = [=] { ShowPath(); };    // when cursor is changed in finished, 
                                                  // show the path
    finished.WhenLeftDouble = [=] { OpenURL(finished); };
    path.AddColumn("Path");
    path.WhenLeftDouble = [=] { OpenURL(path); }; // double-click opens url in browser
    total = 0;
    Zoomable().Sizeable();
}123456789101112131415复制代码类型：[html]

CtrlLayout是WithCrawlerLayout将小部件放置到设计位置的方法。其余代码用列设置列表，并使用中的相应方法连接小部件上的一些用户操作WebCrawler（我们将在以后添加这些方法）。

资料模型

现在，有了无聊的GUI内容，我们将专注于有趣的部分-webcrawler代码。首先，我们将需要一些结构来跟踪事物：

struct WebCrawler : public WithCrawlerLayout<TopWindow> {
    VectorMap<String, int> url;        // maps url to the index of source url
    BiVector<int>          todo;       // queue of url indices to process
    
    struct Work {                      // processing record
        HttpRequest http;              // request
        int         urli;              // url index
    };
    Array<Work>      http;             // work records
    int64            total;            // total bytes downloaded12345678910复制代码类型：[html]

VectorMap是一个独特的U++容器，可以将其视为数组和映射的组合。它提供了基于索引的键和值访问方式，以及一种快速找到键索引的方法。我们将使用url一种避免重复的url请求（将url放入密钥）的方法，并将'parent'url的索引作为值，以便以后可以显示种子url的路径。

接下来，我们要处理一系列的URL。从html提取网址时，我们会将其放入urlVectorMap。这意味着每个url在url中都有唯一的索引，因此我们只需要有索引队列todo。

最后，我们将需要一些缓冲区来保留并发请求。处理记录Work只需HttpRequest与url索引结合即可（只是知道我们要处理的url）。Array是U++容器，能够存储没有任何形式的副本的对象。

主循环

我们有数据模型，让我们开始编写代码。首先，让我们向用户询问种子网址：

void WebCrawler::Run()
{   // query the seed url, then do the show
    String seed = "www.codeproject.com";            // predefined seed url
    if(!EditText(seed, "GuiWebSpider", "Seed URL")) // query the seed url
        return;
    todo.AddTail(0);                                // first url to process index is 0
    url.Add(seed, 0);                               // add to database1234567复制代码类型：[html]

Seed是第一个网址，因此我们知道它将具有index0。我们将简单地将其添加到url和中todo。现在真正的工作开始了：

Open();              // open the main window
while(IsOpen()) {    // run until user closes the window
    ProcessEvents(); // process GUI events123复制代码类型：[html]

我们将运行循环，直到用户关闭窗口。我们需要在此循环中处理GUI事件。循环的其余部分将处理实际内容：

while(todo.GetCount() && http.GetCount() < 60)
{ // we have something to do and have less than 60 active requests
    int i = todo.Head();                     // pop url index from the queue
    todo.DropHead();
    Work& w = http.Add();                    // create a new http request
    w.urli = i;                              // need to know source url index
    w.http.Url(url.GetKey(i))                // setup request url
          .UserAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0)
          Gecko/20100101 Firefox/11.0")      // lie a little :)
          .Timeout(0);                       // asynchronous mode
    work.Add(url.GetKey(i));                 // show processed URL in GUI
    work.HeaderTab(0).SetText
      (Format("URL (%d)", work.GetCount())); // update list header
}1234567891011121314复制代码类型：[html]

如果我们有东西todo并且少于60个并发请求，则添加一个新的并发请求。

下一步是处理所有活动的HTTP请求。HttpRequest类使用method来做到这一点Do。在非阻塞模式下，此方法尝试进行连接请求。我们需要做的就是为所有活动请求调用此方法，然后读取状态。

但是，即使可以在不等待“活动”模式下的实际套接字事件的情况下执行此操作，行为良好的程序也应首先等待，直到可以从套接字写入或读取套接字，以节省系统资源。U++SocketWaitEvent为此提供了确切的类：

SocketWaitEvent we; // we shall wait for something to happen to our request sockets
for(int i = 0; i < http.GetCount(); i++)
    we.Add(http[i].http);
we.Wait(10);        // wait at most 10ms (to keep GUI running)1234复制代码类型：[html]

唯一的问题是SocketWaitEvent仅在套接字上等待，我们可以运行GUI。我们通过将最大等待限制指定为10ms来解决此问题（我们知道目前至少会发生周期性的计时器事件，应由处理ProcessEvents）。

清除此问题后，我们可以继续处理请求：

int i = 0;
while(i < http.GetCount()) {                       // scan through active requests
    Work& w = http[i];
    w.http.Do();                                   // run request
    String u = url.GetKey(w.urli);                 // get the url from index
    int q = work.Find(u);                          // find line of url in GUI work list
    if(w.http.InProgress()) {                      // request still in progress
        if(q >= 0)
            work.Set(q, 1, w.http.GetPhaseName()); // set GUI to inform user
                                                   // about request phase
        i++;
    }
    else { // request finished
        String html = w.http.GetContent();         // read request content
        total += html.GetCount();      // just keep track about total content length
        finished.Add(u, w.http.IsError() ? String().Cat() << w.http.GetErrorDesc()
                                         : String().Cat() << w.http.GetStatusCode()
                                           << ' ' << w.http.GetReasonPhrase()
                                           << " (" << html.GetCount() << " bytes)",
                     w.urli);          // GUI info about finished url status,
                                       // with url index as last parameter
        finished.HeaderTab(0).SetText(Format("Finished (%d)", finished.GetCount()));
        finished.HeaderTab(1).SetText(Format("Response (%` KB)", total >> 10));
        if(w.http.IsSuccess()) {       // request ended OK
            ExtractUrls(html, w.urli); // extact new urls
            Title(AsString(url.GetCount()) + " URLs found"); // update window title
        }
        http.Remove(i);                // remove from active requests
        work.Remove(q);                // remove from GUI list of active requests
    }
}12345678910111213141516171819202122232425262728293031复制代码类型：[html]

这个循环看起来很复杂，但是大多数代码都用于更新GUI。HttpRequest类具有方便的GetPhaseName方法来描述请求中发生的事情。InProgress是true直到请求完成（作为成功或某种故障）。如果请求成功，我们将使用ExtractUrls新的url从html代码进行测试。

获取新的URL

为简单起见，这ExtractUrls是一个非常幼稚的实现，我们要做的就是扫描“http://”或“https://”字符串，然后读取下一个看起来像这样的字符url：

bool IsUrlChar(int c)
{// characters allowed
    return c == ':' || c == '.' || IsAlNum(c) || c == '_' || c == '%' || c == '/';
}

void WebCrawler::ExtractUrls(const String& html, int srci)
{// extract urls from html text and add new urls to database, srci is source url
    int q = 0;
    while(q < html.GetCount()) {
        int http = html.Find("http://", q); // .Find returns next position of pattern
        int https = html.Find("https://", q); // or -1 if not found
        q = min(http < 0 ? https : http, https < 0 ? http : https);
        if(q < 0) // not found
            return;
        int b = q;
        while(q < html.GetCount() && IsUrlChar(html[q]))
            q++;
        String u = html.Mid(b, q - b);
        if(url.Find(u) < 0) {             // do we know about this url?
            todo.AddTail(url.GetCount()); // add its (future) index to todo
            url.Add(u, srci);             // add it to main url database
        }
    }
}123456789101112131415161718192021222324复制代码类型：[html]

我们把所有候选的URLurl，并todo通过主循环处理。

最后的润色

至此，所有的辛苦工作已经完成。其余代码只是两个便捷功能，其中一个在双击finished或path列出时打开url：

void WebCrawler::OpenURL(ArrayCtrl& a)
{
    String u = a.GetKey(); // read url from GUI list
    WriteClipboardText(u); // put it to clipboard
    LaunchWebBrowser(u);   // launch web browser
}123456复制代码类型：[html]

（我们也将URL作为奖励放置在剪贴板上。）

另一个函数填充path列表，以显示从种子URL到finished列表中URL的路径：

void WebCrawler::ShowPath()
{   // shows the path from seed url to finished url
    path.Clear();
    if(!finished.IsCursor())
        return;
    int i = finished.Get(2);  // get the index of finished
    Vector<String> p;
    for(;;) {
        p.Add(url.GetKey(i)); // add url index to list
        if(i == 0)            // seed url added
            break;
        i = url[i];           // get parent url index
    }
    for(int i = p.GetCount() - 1; i >= 0; i--) // display in reverted order, with seed first
        path.Add(p[i]);
}12345678910111213141516复制代码类型：[html]

在这里，我们使用“双重性质”VectorMap使用索引从子网址遍历回到种子。

现在缺少的唯一一小段代码是MAIN：

GUI_APP_MAIN
{
    WebCrawler().Run();
}1234复制代码类型：[html]

接下来，我们进行了大约150行的带有GUI的简单并行Web爬网程序。

jvectormap

上一篇：一文读懂map和hash_map的差异原理
下一篇：集合三兄弟List,Set,Map傻傻理不清?掌握诀窍面面俱到

使用U++的简单多请求Web爬网程序完整的数控加工程序必须由

设计GUI

相关推荐

取消回复欢迎你发表评论:

Linux:Ubuntu22.04上安装python3.11，简单易上手

宝马阿布达比分公司推出独特M4升级套件，整套升级约在20万

MATLAB中图片保存的五种方法(一)（matlab中保存图片命令）

别再傻傻搞不清楚Workstation Player和Workstation Pro的区别了

Linux上使用tinyproxy快速搭建HTTP/HTTPS代理器

如何提取、修改、强刷A卡bios a卡刷bios工具

Element Plus 的 Dialog 组件实现点击遮罩层不关闭对话框

MacOS + AList + 访达，让各种云盘挂载到本地(建议收藏)

日本组合“岚”将于2020年12月31日停止团体活动

SpringCloud OpenFeign 使用 okhttp 发送 HTTP 请求与 HTTP/2 探索

使用U++的简单多请求Web爬网程序 完整的数控加工程序必须由

设计GUI

相关推荐

取消回复欢迎 你 发表评论:

Linux:Ubuntu22.04上安装python3.11，简单易上手

宝马阿布达比分公司推出独特M4升级套件，整套升级约在20万

MATLAB中图片保存的五种方法(一)（matlab中保存图片命令）

别再傻傻搞不清楚Workstation Player和Workstation Pro的区别了

Linux上使用tinyproxy快速搭建HTTP/HTTPS代理器

如何提取、修改、强刷A卡bios a卡刷bios工具

Element Plus 的 Dialog 组件实现点击遮罩层不关闭对话框

MacOS + AList + 访达，让各种云盘挂载到本地(建议收藏)

日本组合“岚”将于2020年12月31日停止团体活动

SpringCloud OpenFeign 使用 okhttp 发送 HTTP 请求与 HTTP/2 探索

使用U++的简单多请求Web爬网程序完整的数控加工程序必须由

取消回复欢迎你发表评论: