Baidu AnyQ V -- logical understanding of FAQ

Keywords: Linux Operation & Maintenance

  • When you start the service before, you can find that you start the solr service first and then the faq service,
  • It can be seen from Baidu AnyQ four that the FAQ data set is completely controlled by solr,
  • So in anyq, the data and the model (logic control) are very loosely coupled.

So in this part, consider the logic part

1. run_server

1.1 locking files

Start the faq service in. / build / run_ In the server section, view the file type, and use ls -l or ll to view the file details.

  • If you don't know anything about linux file system, you can read another article: File type description in linux
  • Where run_server is an ordinary file and an executable file. Use VI. / run_ The server is opened in a garbled code, so its execution logic cannot be viewed.
  • Considering this run_ The server is located in the build folder, which is from cmake & & make. Therefore, check the CMaKeLists.txt file (the execution entry of cmake). Finally, you can find the following contents:
    add_executable(demo_anyq_multi ${CMAKE_SOURCE_DIR}/demo/demo_anyq_multi.cpp)
    add_executable(demo_anyq ${CMAKE_SOURCE_DIR}/demo/demo_anyq.cpp)
    add_executable(run_server ${CMAKE_SOURCE_DIR}/demo/run_server.cpp)
    add_executable(annoy_index_build_tool ${CMAKE_SOURCE_DIR}/demo/annoy_index_build.cpp)
    add_executable(feature_dump_tool ${CMAKE_SOURCE_DIR}/demo/feature_dump.cpp)
    
    target_link_libraries(demo_anyq_multi ${LIBS_LIST})
    target_link_libraries(demo_anyq ${LIBS_LIST})
    target_link_libraries(run_server ${LIBS_LIST})
    target_link_libraries(annoy_index_build_tool ${LIBS_LIST})
    target_link_libraries(feature_dump_tool ${LIBS_LIST})
    
  • Therefore, there are corresponding cpp files, and the variable ${LIBS_LIST} is also defined in CMaKeLists.txt file, which are some. a and. so files,
  • Where. A is a combination of multiple. OS for static connection, i.e. STATIC mode. Multiple. A can be linked to generate an exe executable.
  • . so is a shared object. It is used for dynamic connection. It is similar to the dll of windows. It is loaded only when used.

1.2 file execution logic in demo folder

First look at run_server, code as follows:

#include <glog/logging.h>
#include "server/http_server.h"
#include "common/utils.h"
#include "common/plugin_header.h"

int main(int argc, char* argv[]) {
    google::InitGoogleLogging(argv[0]);
    FLAGS_stderrthreshold = google::INFO;
    anyq::HttpServer server;
    std::string anyq_brpc_conf = "./example/conf/anyq_brpc.conf";
    if (server.init(anyq_brpc_conf) != 0) {
        FATAL_LOG("server init failed");
        return -1;
    }

    if (server.always_run() != 0) {
        FATAL_LOG("server run failed");
        return -1;
    }
    return 0;
}
  • First, an example of anyq:: httpserver class is declared, which comes from server/http_server.h file
  • Second, specify the location of the configuration file ". / example/conf/anyq_brpc.conf", and pass it to the server just now for initialization. View the contents of this file as follows:
idle_timeout_sec : -1
max_concurrency : eight
port : 8999
server_conf_dir : "./example/conf/"
log_conf_file : "log.conf"
anyq_dict_conf_dir : "./example/conf/"
anyq_conf_dir: "./example/conf/"

preproc_plugin {
    name : "default_preproc"
    type : "AnyqPreprocessor"
}

postproc_plugin {
    name : "default_postproc"
    type : "AnyqPostprocessor"
}
  • Continue to view server/http_server.h file, the main content of which is to define HttpServer as a public class in the namespace of any.
  • Its main function is to initialize and input the configuration information just passed in to the corresponding server_conf_dir,log_conf_file,anyq_dict_conf_dir,anyq_conf_dir,server_config and other more explicit configuration items.
  • The file #include "server/http_service_impl.h" is also introduced in this file
  • Continue to view #include the file "server / http_service_impl. H". The meaning of the file name is: it is implemented by the server, and the content is also very specific
namespace anyq {
class HttpServiceImpl : public anyq::HttpService {
public:
    HttpServiceImpl();
    ~HttpServiceImpl();
    int init(const ServerConfig& server_config);
    int destroy();
    int normalize_input(brpc::Controller* cntl, Json::Value& parameters);
    // Question answering semantic retrieval
    void anyq(google::protobuf::RpcController* cntl_base,
            const HttpRequest*,
            HttpResponse*,
            google::protobuf::Closure* done);
    
    // solr data manipulation interface -- adding data
    void solr_insert(google::protobuf::RpcController* cntl_base,
            const HttpRequest*,
            HttpResponse*,
            google::protobuf::Closure* done);

    // solr data manipulation interface -- update data
    void solr_update(google::protobuf::RpcController* cntl_base,
            const HttpRequest*,
            HttpResponse*,
            google::protobuf::Closure* done);

    // solr data manipulation interface -- delete data
    void solr_delete(google::protobuf::RpcController* cntl_base,
            const HttpRequest*,
            HttpResponse*,
            google::protobuf::Closure* done);

    // solr data manipulation interface -- clearing the index library requires password verification
    void solr_clear(google::protobuf::RpcController* cntl_base,
            const HttpRequest*,
            HttpResponse*,
            google::protobuf::Closure* done);

private:
    // Pre processing: process the data (get/post) received by the server into anyq input format
    ReqPreprocInterface* _preproc_plugin;
    // After processing, customize the output of anyq
    ReqPostprocInterface* _postproc_plugin;
    DISALLOW_COPY_AND_ASSIGN(HttpServiceImpl);
};

} // namespace anyq
#endif  // BAIDU_NLP_ANYQ_HTTP_SERVICE_IMPL_H
  • This file refers to more content, including a brpc, which was previously downloaded and compiled through github. Please refer to Detailed explanation of BRPC (I) - Overview
  • In addition, these two files were not found, so use the find command to search
    #include "http_service.pb.h"
    #include "anyq.pb.h"
    
  • You can see that it is found, but it is located in the include folder. Unlike the above. h, although it is located in the include folder, the default include folder of any repo does not have the config folder. This should be generated during compilation or when. Check the CMaKeLists.txt folder, and you do see it in it
    [root@567b3aed2b1c AnyQ-master]$ find . -name "http_service.pb.h" -print
    ./include/config/http_service.pb.h
    
    SET(PROTO_INC ${CMAKE_SOURCE_DIR}/include/config)   # 30 lines
    ${CMAKE_SOURCE_DIR}/include/config  # 60 lines    
    
  • Then, in the. / include/config folder, you do see the two missing header files
    [root@567b3aed2b1c config]# ls
    anyq.pb.h  http_service.pb.h
    
  • Take the anyq.pb.h file as an example, its contents are as follows:
    // Generated by the protocol buffer compiler.  DO NOT EDIT!
    // source: anyq.proto
    
    #ifndef PROTOBUF_anyq_2eproto__INCLUDED
    #define PROTOBUF_anyq_2eproto__INCLUDED
    
    #include <string>
    
    #include <google/protobuf/stubs/common.h>
    
    #if GOOGLE_PROTOBUF_VERSION < 3001000
    #error This file was generated by a newer version of protoc which is
    #error incompatible with your Protocol Buffer headers.  Please update
    #error your headers.
    #endif
    #if 3001000 < GOOGLE_PROTOBUF_MIN_PROTOC_VERSION
    #error This file was generated by an older version of protoc which is
    #error incompatible with your Protocol Buffer headers.  Please
    #error regenerate this file with a newer version of protoc.
    #endif
    
    #include <google/protobuf/arena.h>
    #include <google/protobuf/arenastring.h>
    #include <google/protobuf/generated_message_util.h>
    #include <google/protobuf/metadata.h>
    #include <google/protobuf/message.h>
    #include <google/protobuf/repeated_field.h>
    #include <google/protobuf/extension_set.h>
    #include <google/protobuf/unknown_field_set.h>
    // @@protoc_insertion_point(includes)
    
    namespace anyq {
    
    // Internal implementation detail -- do not call these.
    void protobuf_AddDesc_anyq_2eproto();
    void protobuf_InitDefaults_anyq_2eproto();
    void protobuf_AssignDesc_anyq_2eproto();
    void protobuf_ShutdownFile_anyq_2eproto();
    
  • As you can see, the key hint in the previous lines is that this is automatically generated by the protocol buffer compiler, and the source file is anyq.proto
  • Similarly, for http_ The same is true for service. Pb. H.
  • These two proto files, anyq this repo, are available, and are not generated after compilation.
  • Http_ Take service.proto as an example: I can't understand it... give up
    package anyq;	
    option cc_generic_services = true;	
    message HttpRequest {
    };	
    message HttpResponse {
    };	
    service HttpService {
            rpc anyq(HttpRequest) returns (HttpResponse);
            rpc solr_insert(HttpRequest) returns (HttpResponse);
            rpc solr_update(HttpRequest) returns (HttpResponse);
            rpc solr_delete(HttpRequest) returns (HttpResponse);
            rpc solr_clear(HttpRequest) returns (HttpResponse);
    };
    

1.3 borrowing tools to view function calls

Refer to another article for relevant contents: Trace program calls using strace

The overall process is summarized as follows:
(some parameters are not added when docker run s, so strace cannot be used, so restart a container to track strace only)

# Add the -- privileged parameter to use strace
$ docker run -itd --privileged --name anyq-trace -p 0.0.0.0:8876:8999 -p 0.0.0.0:8700:8900 anyq/base 

$ docker exec -it anyq-trace /bin/bash

$ cd /home/AnyQ-master/build/ 

# Execute solr service
$ sh solr_script/anyq_solr.sh solr_script/sample_docs
# Execute faq service
$ ./run_server
# Verify that it is correct

# Then start tracking
$ strace ./run_server

Display many contents, such as:

The tools I searched for don't seem to meet my needs

1.4 continue to study the CMaKeLists.txt file

In the CMaKeLists file, and run_ There are two lines related to server, as follows

add_executable(run_server ${CMAKE_SOURCE_DIR}/demo/run_server.cpp)
target_link_libraries(run_server ${LIBS_LIST})
# This is the file to be executed later,

reference resources:

As you can see, add here_ The function of executable is to build an executable target file from the specified source file list.

Or check the execution. Run_ log information printed after server

2. log information printed during execution

  1. ./example/conf/./rank_ The contents of the weights file are as follows:

    jaccard_sim     0.2
    fluid_simnet_feature    0.8
    
  2. ./wordseg_utf8 folder contains
    It's really some dictionaries, such as: strong_punc.dic contains punctuation marks as follows:

    !
    . 
    !
    ;
    ;
    

    word.dic contains 26 English letters, case, numbers, punctuation, etc.

  3. There is a term2id.dict dictionary in. / simnet. Its content form is as follows

    Herman·Hesse     1
    weifeng 2
     Miaoshan    3
     stick    4
     Horizontal angle 5
     Sticky rice noodles 6
     computer projector       7
     China Everbright International Limited     8
     Aicheng Travel Network      9
     Zhizi    10
     100 million mu    eleven
     Otorhinolaryngology        12
     Health and Family Planning Bureau      thirteen
     Water collector 14
     Inner tube    fifteen
    LUXURY  16
     Scrap crusher      seventeen
     Weifang people's Hospital
     Sinan Mansions         19
     Fuhua    20
     IELTS test network      21
    
  4. About term_retrieval.cpp:77] RAW: create solr q builder equal_ solr_ q_ 1. The output information of success is located in term_ In the retrieval.cpp file, further locate the plugin_factory.h. There is a comment: / / generate a component instance according to the component type, and destroy the instance created by yourself. The factory is not responsible. The components here are actually configuration items, so go to view all configuration items.

  5. Organize all conf files in / build/example/conf file as follows:
    analysis.conf

    name: "analysis_conf"
    
    analysis_method {
        name: "method_wordseg"
        type: "AnalysisWordseg"
        using_dict_name: "lac"
    }
    

    anyq_brpc.conf

    idle_timeout_sec : -1
    max_concurrency : 8
    port : 8999
    server_conf_dir : "./example/conf/"
    log_conf_file : "log.conf"
    anyq_dict_conf_dir : "./example/conf/"
    anyq_conf_dir: "./example/conf/"
    
    preproc_plugin {
        name : "default_preproc"
        type : "AnyqPreprocessor"
    }
    
    postproc_plugin {
        name : "default_postproc"
        type : "AnyqPostprocessor"
    }
    

    anyq.conf

    analysis_config: "analysis.conf"
    retrieval_config: "retrieval.conf"
    rank_config: "rank.conf"
    

    dict.conf

    name: "example_dict_conf"
    
    dict_config {
        name: "rank_weights"
        type: "String2FloatAdapter"
        path: "./rank_weights"
    }
    
    dict_config {
        name: "lac"
        type: "WordsegAdapter"
        path: "./wordseg_utf8"
    }
    
    dict_config{
        name: "fluid_simnet"
        type: "PaddleSimAdapter"
        path: "./simnet"
    }
    
  6. In the rank.conf file, you can see that only one top-one is needed as the final result. threshold:0.5. Carefully check the output in the semantic matching stage. The output content is not only the n with the highest probability, but all those with a probability greater than 0.5.

    rank.conf

    name : "test_rank"
    
    top_result: 1
    
    matching_config {
        name : "wordseg_process"
        type : "WordsegProcessor"
        using_dict_name: "lac"
        output_num : 0
        rough : false
    }
    
    matching_config {
        name: "fluid_simnet_feature"
        type: "PaddleSimilarity"
        using_dict_name: "fluid_simnet"
        output_num : 1
        rough : false
        query_feed_name: "left"
        cand_feed_name: "right"
        score_fetch_name: "cos_sim_0.tmp"
    }
    
    matching_config {
        name : "jaccard_sim"
        type : "JaccardSimilarity"
        output_num : 1
        rough : false
    }
    rank_predictor {
        type: "PredictLinearModel"
        using_dict_name: "rank_weights"
    }
    threshold : 0.5
    

  7. In the rough sorting stage, data retrieval returns 15 question s containing term s in query, which are specified in the retrieval.conf configuration file. At the same time, you can see the engine used_ Name: "Collection1" is also specified here. That's why
    Baidu AnyQ IV - solr data addition test Even if the mask is replaced in_ Core data, but the retrieval is still from collection1.
    retrieval.conf

    retrieval_plugin {
        name : "term_recall_1"
        type : "TermRetrievalPlugin"
        search_host : "127.0.0.1"
        search_port : 8900
        engine_name : "collection1"
        solr_result_fl : "id,question,answer"
        solr_q : {
            type : "EqualSolrQBuilder"
            name : "equal_solr_q_1"
            solr_field : "question"
            source_name : "question"
        }
        num_result : 15
    }
    

Posted by Chips on Wed, 10 Nov 2021 06:52:01 -0800