表格结构识别
Table Structure Recognition
训练 :export CUDA_VISIBLE_DEVICES=4,5,6,7 && cd src && source $(conda info --base)/etc/profile.d/conda.sh && conda activate adp && torchrun --rdzv-backend=c10d --rdzv_endpoint localhost:0 --nnodes=1 --nproc_per_node=4 -m main ++name=EXP_r18_e2_d4_adamw_mamba dataset=pubtabnet ++dataset.root_dir="/rxhui/lxj/download/PubTableNet/pubtabnet" model/model/backbone=imgcnn ++model.model.encoder.nlayer=2 ++model.model.decoder.nlayer=4 ++model.model.backbone.backbone._target_=torchvision.models.resnet34 ++model.model.backbone.output_channels=512 trainer/train/optimizer=adamw ++trainer.train.batch_size=32 ++trainer.valid.batch_size=16 ++trainer.mode="train"
测试生成结果 :export CUDA_VISIBLE_DEVICES=0,1,2,3 && cd src && source $(conda info --base)/etc/profile.d/conda.sh && conda activate adp && torchrun --rdzv-backend=c10d --rdzv_endpoint localhost:0 --nnodes=1 --nproc_per_node=4 -m main ++name=EXP_r18_e2_d4_adamw_mamba dataset=pubtabnet ++dataset.root_dir="/rxhui/lxj/download/PubTableNet/pubtabnet" model/model/backbone=imgcnn ++model.model.encoder.nlayer=2 ++model.model.decoder.nlayer=4 ++model.model.backbone.backbone._target_=torchvision.models.resnet18 ++model.model.backbone.output_channels=512 trainer/train/optimizer=adamw ++trainer.train.batch_size=32 ++trainer.valid.batch_size=16 ++trainer.mode="test" ++trainer.test.model=../EXP_r18_e2_d4_adamw_mamba/model/best.pt
测试分数 : cd src &&python -m utils.teds -f "../experiments/EXP_mamba_Double/html_table_result.json" -s
1.数据集
1.TableBank
TableBank 是基于图像的表格检测和识别数据集。由于涉及两个任务,所以它由两个部分组成。对于表格检测任务,包含了表格区域的页面和边框的图像。对于表格结构识别任务,提供了表示行和列的排列以及表格单元类型的页面和 HTML 标记序列的图像。然而,由于这个数据集不涉及文本内容识别任务,因此不包含文本内容及其边界框。
https://github.com/doc-analysis/TableBank
2.Marmot
https://www.icst.pku.edu.cn/cpdp/sjzy/index.htm
Marmot 数据集由中英文两部分组成。中文网页是从方正阿帕比图书馆(Founder Apabi library)提供的超过 120 本不同学科领域的电子书中收集的,而英文网页则来自 Citeseer 网站。该数据集是基于 PDF 格式文件的,存储了所有文档布局的树结构,其中的叶子是字符、图像和路径,根是整个页面。内部节点包括文本行、段落、表格等。
3.PubTabNet
PubTabNet 数据集包含 56.8 万张表格数据的图像,以及图像对应的 html 格式的注释。更具体地说,该数据集提供了表格结构和字符,但缺少边界框。
4.FinTab
GitHub - Irene323/GFTE: A GCN-based table structure recognition method
在这个数据集中,总共收集了 19 个 PDF 文件和 1600 多个表格。具体文件分类见表 2。所有文件总计 3329 页,其中 2522 页包含表格。为了保证表格类型的多样性,除了表格的基本形式外,FinTab 中还包括了不同难度的特殊表格形式,如半规则表格、跨页表格、合并单元格表格、多行标题表格等。FinTab 中共有 119021 个单元格,合并单元 2859 个,占 2.4%。
5.SciTSR
https://github.com/Academic-Hammer/SciTSR
https://pan.baidu.com/s/11YHEGfGVF9cDxD2qj35kKw 密码1234
SciTSR 是一个大规模的表格结构识别数据集,用于训练和测试表格结构识别模型。这个数据集包含了15,000个PDF格式的表格,以及从LaTeX源文件中获取的这些表格的结构标签。数据集被分为12,000个训练样本和3,000个测试样本。此外,我们还提供了一个只包含复杂表格的测试子集,称为SciTSR-COMP。您可以从这里下载SciTSR数据集。
2.结合
//几个任务都使用的生成模型 - 将生成Box的模型和生成结构的模型结合起来,统一成生成式任务。
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
//
An End-to-End Multi-Task Learning Model for Image-based Table Recognition 结合
table网络
EncoderDecoder(
(backbone): ImgLinearBackbone(
(conv_proj): Conv2d(3, 512, kernel_size=(16, 16), stride=(16, 16))
)
(encoder): Encoder(
(encoder): TransformerEncoder(
(layers): ModuleList(
(0-3): 4 x TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
)
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.2, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.2, inplace=False)
(dropout2): Dropout(p=0.2, inplace=False)
)
)
)
)
(decoder): Decoder(
(decoder): TransformerDecoder(
(layers): ModuleList(
(0-3): 4 x TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
)
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.2, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.2, inplace=False)
(dropout2): Dropout(p=0.2, inplace=False)
(dropout3): Dropout(p=0.2, inplace=False)
)
)
)
)
(norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(token_embed): TokenEmbedding(
(embedding): Embedding(891, 512, padding_idx=2)
)
(pos_embed): PositionEmbedding(
(embedding): Embedding(1024, 512)
(dropout): Dropout(p=0.2, inplace=False)
)
(generator): Linear(in_features=512, out_features=891, bias=True)
)
box-编码
#1.bboxes按照顺序从数据集读出来
bboxes = [
entry["bbox"]
for entry in obj[1]["cells"]
if "bbox" in entry
and entry["bbox"][0] < entry["bbox"][2]
and entry["bbox"][1] < entry["bbox"][3]
]
[[1, 4, 76, 13], [92, 4, 167, 13], [342, 4, 369, 13], [409, 4, 473, 13], [410, 17, 447, 27], [465, 17, 482, 27], [323, 30, 349, 40], [364, 30, 387, 40], [400, 30, 425, 40], [433, 30, 457, 40], [0, 44, 33, 53], [92, 44, 208, 53], [320, 44, 352, 53], [360, 44, 392, 53], [408, 44, 417, 53], [440, 44, 450, 53], [469, 44, 478, 53], [92, 56, 241, 65], [320, 56, 352, 65], [360, 56, 392, 65], [410, 56, 415, 65], [440, 56, 450, 65], [469, 56, 478, 65], [92, 67, 182, 77], [320, 67, 352, 77], [360, 67, 392, 77], [408, 67, 417, 77], [442, 67, 448, 77], [469, 67, 478, 77], [92, 79, 245, 89], [320, 79, 352, 89], [360, 79, 392, 89], [410, 79, 415, 89], [442, 79, 448, 89], [469, 79, 478, 89], [92, 91, 225, 101], [320, 91, 352, 101], [360, 91, 392, 101], [410, 91, 415, 101], [442, 91, 448, 101], [469, 91, 478, 101], [92, 103, 156, 113], [320, 103, 352, 113], [360, 103, 392, 113], [410, 103, 415, 113], [442, 103, 448, 113], [469, 103, 478, 113], [92, 115, 178, 125], [320, 115, 352, 125], [410, 115, 415, 125], [469, 115, 478, 125], [92, 127, 242, 137], [360, 127, 392, 137], [442, 127, 448, 137], [469, 127, 478, 137], [92, 139, 207, 149], [360, 139, 392, 149], [442, 139, 448, 149], [469, 139, 478, 149], ...]
#2.按照448进行resize
bboxes[:] = [
i
for entry in bboxes
for i in bbox_augmentation_resize(entry, img_size, tgt_size)
]
#3.形成一个序列
[1, 4, 70, 13, 85, 4, 154, 13, 315, 4, 340, 13, 377, 4, 436, 13, 378, 17, 412, 27, 429, 17, 444, 27, 298, 30, 322, 41, 336, 30, 357, 41, 369, 30, 392, 41, 399, 30, 421, 41, 0, 45, 30, 54, 85, 45, 192, 54, 295, 45, 324, 54, 332, 45, 361, 54, 376, 45, 384, ...]
#Encoding
Encoding(num_tokens=710, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
import tokenizers as tk
def prepare_bbox_seq(seq: List[dict]):
tmp = [f"bbox-{round(i)}" for i in seq]
out = ["[bbox]"] + tmp + ["<eos>"]
return out
vocab = tk.Tokenizer.from_file(str(cwd / cfg.vocab.dir))
bbox_list = [" ".join(prepare_bbox_seq(i["bbox"])) for i in batch]
label["bbox"] = vocab.encode_batch(bbox_list)
3.实验结果
对比实验
模型 | Simple | Complex | Total | Parm(m) |
---|---|---|---|---|
TableFormer | 98.50 | 95.00 | 96.75 | Not reported |
ours | 98.70 | 95.31 | 97.04 | 42.52 |
模型 | Simple | Complex | Total | Parm(m) | Box_Map |
---|---|---|---|---|---|
resnet18 + Tranformer(2 encoder) | 0.9831 | 0.9450 | 0.9645 | 28.70 | |
resnet18 + Tranformer(2 encoder)+ Box-Decoder | 0.9846 | 0.9476 | 0.9665 | 42.75 | |
resnet18 + mamba(2 encoder)+ Box-Decoder | 0.9864 | 0.9477 | 0.9675 | 41.94 | |
mambaout + mamba(2 encoder)+ Box-Decoder | 0.9870 | 0.9531 | 0.9704 | 42.52 | 0.9013 |
mambaout + Vmamba(2 encoder)+ Box-Decoder | 0.9875 | 0.9548 | 0.9715 | 47.76 | 0.9064 |