PaddleHub
PaddleHub

LCQMC

class paddlehub.datasets.LCQMC(tokenizer: Union[BertTokenizer, CustomTokenizer], max_seq_len: int = 128, mode: str = 'train'):

A Large-scale Chinese Question Matching Corpus.


  • Args:
    • tokenizer(BertTokenizer or CustomTokenizer)

      It tokenizes the text and encodes the data as model needed.

    • max_seq_len(int, optional, defaults to :128)

      The maximum length (in number of tokens) for the inputs to the selected module, such as ernie, bert and so on.

    • mode(str, optional, defaults to train):

      It identifies the dataset mode (train, test or dev).