表格结构识别

Table Structure Recognition

训练 ：export CUDA_VISIBLE_DEVICES=4,5,6,7 && cd src && source $(conda info --base)/etc/profile.d/conda.sh && conda activate adp && torchrun --rdzv-backend=c10d --rdzv_endpoint localhost:0 --nnodes=1 --nproc_per_node=4 -m main ++name=EXP_r18_e2_d4_adamw_mamba dataset=pubtabnet ++dataset.root_dir="/rxhui/lxj/download/PubTableNet/pubtabnet" model/model/backbone=imgcnn ++model.model.encoder.nlayer=2 ++model.model.decoder.nlayer=4 ++model.model.backbone.backbone._target_=torchvision.models.resnet34 ++model.model.backbone.output_channels=512 trainer/train/optimizer=adamw ++trainer.train.batch_size=32 ++trainer.valid.batch_size=16 ++trainer.mode="train"

测试生成结果 ：export CUDA_VISIBLE_DEVICES=0,1,2,3 && cd src && source $(conda info --base)/etc/profile.d/conda.sh && conda activate adp && torchrun --rdzv-backend=c10d --rdzv_endpoint localhost:0 --nnodes=1 --nproc_per_node=4 -m main ++name=EXP_r18_e2_d4_adamw_mamba dataset=pubtabnet ++dataset.root_dir="/rxhui/lxj/download/PubTableNet/pubtabnet" model/model/backbone=imgcnn ++model.model.encoder.nlayer=2 ++model.model.decoder.nlayer=4 ++model.model.backbone.backbone._target_=torchvision.models.resnet18 ++model.model.backbone.output_channels=512 trainer/train/optimizer=adamw ++trainer.train.batch_size=32 ++trainer.valid.batch_size=16 ++trainer.mode="test" ++trainer.test.model=../EXP_r18_e2_d4_adamw_mamba/model/best.pt
    
测试分数 ： cd src &&python -m utils.teds -f "../experiments/EXP_mamba_Double/html_table_result.json" -s

1.数据集

1.TableBank

TableBank 是基于图像的表格检测和识别数据集。由于涉及两个任务，所以它由两个部分组成。对于表格检测任务，包含了表格区域的页面和边框的图像。对于表格结构识别任务，提供了表示行和列的排列以及表格单元类型的页面和 HTML 标记序列的图像。然而，由于这个数据集不涉及文本内容识别任务，因此不包含文本内容及其边界框。

https://github.com/doc-analysis/TableBank

2.Marmot

https://www.icst.pku.edu.cn/cpdp/sjzy/index.htm

Marmot 数据集由中英文两部分组成。中文网页是从方正阿帕比图书馆（Founder Apabi library）提供的超过 120 本不同学科领域的电子书中收集的，而英文网页则来自 Citeseer 网站。该数据集是基于 PDF 格式文件的，存储了所有文档布局的树结构，其中的叶子是字符、图像和路径，根是整个页面。内部节点包括文本行、段落、表格等。

3.PubTabNet

PubTabNet 数据集包含 56.8 万张表格数据的图像，以及图像对应的 html 格式的注释。更具体地说，该数据集提供了表格结构和字符，但缺少边界框。

4.FinTab

GitHub - Irene323/GFTE: A GCN-based table structure recognition method

在这个数据集中，总共收集了 19 个 PDF 文件和 1600 多个表格。具体文件分类见表 2。所有文件总计 3329 页，其中 2522 页包含表格。为了保证表格类型的多样性，除了表格的基本形式外，FinTab 中还包括了不同难度的特殊表格形式，如半规则表格、跨页表格、合并单元格表格、多行标题表格等。FinTab 中共有 119021 个单元格，合并单元 2859 个，占 2.4%。

5.SciTSR

https://github.com/Academic-Hammer/SciTSR

https://pan.baidu.com/s/11YHEGfGVF9cDxD2qj35kKw 密码1234

SciTSR 是一个大规模的表格结构识别数据集，用于训练和测试表格结构识别模型。这个数据集包含了15,000个PDF格式的表格，以及从LaTeX源文件中获取的这些表格的结构标签。数据集被分为12,000个训练样本和3,000个测试样本。此外，我们还提供了一个只包含复杂表格的测试子集，称为SciTSR-COMP。您可以从这里下载SciTSR数据集。

2.结合

//几个任务都使用的生成模型 - 将生成Box的模型和生成结构的模型结合起来，统一成生成式任务。

High-Performance Transformers for Table Structure Recognition Need Early Convolutions

An End-to-End Multi-Task Learning Model for Image-based Table Recognition 结合

table网络

EncoderDecoder(
  (backbone): ImgLinearBackbone(
    (conv_proj): Conv2d(3, 512, kernel_size=(16, 16), stride=(16, 16))
  )
  (encoder): Encoder(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-3): 4 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
          )
          (linear1): Linear(in_features=512, out_features=2048, bias=True)
          (dropout): Dropout(p=0.2, inplace=False)
          (linear2): Linear(in_features=2048, out_features=512, bias=True)
          (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.2, inplace=False)
          (dropout2): Dropout(p=0.2, inplace=False)
        )
      )
    )
  )
  (decoder): Decoder(
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0-3): 4 x TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
          )
          (linear1): Linear(in_features=512, out_features=2048, bias=True)
          (dropout): Dropout(p=0.2, inplace=False)
          (linear2): Linear(in_features=2048, out_features=512, bias=True)
          (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.2, inplace=False)
          (dropout2): Dropout(p=0.2, inplace=False)
          (dropout3): Dropout(p=0.2, inplace=False)
        )
      )
    )
  )
  (norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  (token_embed): TokenEmbedding(
    (embedding): Embedding(891, 512, padding_idx=2)
  )
  (pos_embed): PositionEmbedding(
    (embedding): Embedding(1024, 512)
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (generator): Linear(in_features=512, out_features=891, bias=True)
)

box-编码

#1.bboxes按照顺序从数据集读出来
bboxes = [
                    entry["bbox"]
                    for entry in obj[1]["cells"]
                    if "bbox" in entry
                    and entry["bbox"][0] < entry["bbox"][2]
                    and entry["bbox"][1] < entry["bbox"][3]
                ]
[[1, 4, 76, 13], [92, 4, 167, 13], [342, 4, 369, 13], [409, 4, 473, 13], [410, 17, 447, 27], [465, 17, 482, 27], [323, 30, 349, 40], [364, 30, 387, 40], [400, 30, 425, 40], [433, 30, 457, 40], [0, 44, 33, 53], [92, 44, 208, 53], [320, 44, 352, 53], [360, 44, 392, 53], [408, 44, 417, 53], [440, 44, 450, 53], [469, 44, 478, 53], [92, 56, 241, 65], [320, 56, 352, 65], [360, 56, 392, 65], [410, 56, 415, 65], [440, 56, 450, 65], [469, 56, 478, 65], [92, 67, 182, 77], [320, 67, 352, 77], [360, 67, 392, 77], [408, 67, 417, 77], [442, 67, 448, 77], [469, 67, 478, 77], [92, 79, 245, 89], [320, 79, 352, 89], [360, 79, 392, 89], [410, 79, 415, 89], [442, 79, 448, 89], [469, 79, 478, 89], [92, 91, 225, 101], [320, 91, 352, 101], [360, 91, 392, 101], [410, 91, 415, 101], [442, 91, 448, 101], [469, 91, 478, 101], [92, 103, 156, 113], [320, 103, 352, 113], [360, 103, 392, 113], [410, 103, 415, 113], [442, 103, 448, 113], [469, 103, 478, 113], [92, 115, 178, 125], [320, 115, 352, 125], [410, 115, 415, 125], [469, 115, 478, 125], [92, 127, 242, 137], [360, 127, 392, 137], [442, 127, 448, 137], [469, 127, 478, 137], [92, 139, 207, 149], [360, 139, 392, 149], [442, 139, 448, 149], [469, 139, 478, 149], ...]
#2.按照448进行resize
bboxes[:] = [
                    i
                    for entry in bboxes
                    for i in bbox_augmentation_resize(entry, img_size, tgt_size)
                ]
#3.形成一个序列
[1, 4, 70, 13, 85, 4, 154, 13, 315, 4, 340, 13, 377, 4, 436, 13, 378, 17, 412, 27, 429, 17, 444, 27, 298, 30, 322, 41, 336, 30, 357, 41, 369, 30, 392, 41, 399, 30, 421, 41, 0, 45, 30, 54, 85, 45, 192, 54, 295, 45, 324, 54, 332, 45, 361, 54, 376, 45, 384, ...]

#Encoding
Encoding(num_tokens=710, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


import tokenizers as tk
def prepare_bbox_seq(seq: List[dict]):
    tmp = [f"bbox-{round(i)}" for i in seq]
    out = ["[bbox]"] + tmp + ["<eos>"]

    return out
vocab = tk.Tokenizer.from_file(str(cwd / cfg.vocab.dir))
bbox_list = [" ".join(prepare_bbox_seq(i["bbox"])) for i in batch]
label["bbox"] = vocab.encode_batch(bbox_list)

3.实验结果

对比实验

模型	Simple	Complex	Total	Parm（m）
TableFormer	98.50	95.00	96.75	Not reported

ours	98.70	95.31	97.04	42.52

模型	Simple	Complex	Total	Parm（m）	Box_Map
resnet18 + Tranformer（2 encoder）	0.9831	0.9450	0.9645	28.70
resnet18 + Tranformer（2 encoder）+ Box-Decoder	0.9846	0.9476	0.9665	42.75
resnet18 + mamba（2 encoder）+ Box-Decoder	0.9864	0.9477	0.9675	41.94
mambaout + mamba（2 encoder）+ Box-Decoder	0.9870	0.9531	0.9704	42.52	0.9013
mambaout + Vmamba（2 encoder）+ Box-Decoder	0.9875	0.9548	0.9715	47.76	0.9064

表格结构识别

1.数据集​

1.TableBank​

2.Marmot​

3.PubTabNet​

4.FinTab​

5.SciTSR​

2.结合​

table网络​

box-编码​

3.实验结果​