File size: 2,530 Bytes
c70396e
027905a
ee67b3c
365b962
ee67b3c
 
 
3946810
ee67b3c
3946810
e3cee23
027905a
ee67b3c
9d1c06b
9b1bf94
 
 
e18488b
 
 
9d1c06b
 
 
 
 
 
 
 
 
 
 
c70396e
 
 
 
 
 
 
 
 
 
 
 
 
 
9d1c06b
c70396e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c92415c
c70396e
 
 
 
 
 
 
 
 
 
9914508
c70396e
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

# Visual semantic with BERT-CNN

This model can be used to assign an object-to-caption semantic relatedness score, which is valuable for
(1) caption diverse re-ranking, and (2) generate soft labels for caption filtering when scraping text-to-captions from the internet.

To take advantage of the overlapping between the visual context and the caption, and to extract global information from each visual (i.e., object, scene, etc) we use BERT  as an embedding layer followed by a shallow CNN (tri-gram kernel) (Kim, 2014).
 

 For datasets that are less than 100K please have look at our [shallow model](https://github.com/ahmedssabir/Semantic-Relatedness-Based-Reranker-for-Text-Spotting) 




The model is trained with a strict filter of 0.4 similarity distance thresholds between the object and its related caption.   

 For a quick start please have a look at this [colab](https://colab.research.google.com/drive/1N0JVa6y8FKGLLSpiG7hd_W75UYhHRe2j?usp=sharing)
 
 For the [dataset](https://huggingface.co/datasets/AhmedSSabir/Textual-Image-Caption-Dataset)
 

```
conda create -n BERT_visual python=3.6 anaconda
conda activate BERT_visual
pip install tensorflow==1.15.0
pip install --upgrade tensorflow_hub==0.7.0
```

```
git clone https://github.com/gaphex/bert_experimental/
```


```python
import tensorflow as tf
import numpy as np
import pandas as pd
import sys
from sklearn.model_selection import train_test_split

sys.path.insert(0, "bert_experimental")

from bert_experimental.finetuning.text_preprocessing import build_preprocessor
from bert_experimental.finetuning.graph_ops import load_graph

df = pd.read_csv("test.tsv", sep='\t')

texts = []
delimiter = " ||| "

for vis, cap  in zip(df.visual.tolist(), df.caption.tolist()):
  texts.append(delimiter.join((str(vis), str(cap))))

texts = np.array(texts)

trX, tsX = train_test_split(texts, shuffle=False, test_size=0.01)

restored_graph = load_graph("frozen_graph.pb")

graph_ops = restored_graph.get_operations()
input_op, output_op = graph_ops[0].name, graph_ops[-1].name
print(input_op, output_op)

x = restored_graph.get_tensor_by_name(input_op + ':0')
y = restored_graph.get_tensor_by_name(output_op + ':0')

preprocessor = build_preprocessor("vocab.txt", 64)
py_func = tf.numpy_function(preprocessor, [x], [tf.int32, tf.int32, tf.int32], name='preprocessor')

##predictions
sess = tf.Session(graph=restored_graph)

print(trX[:4])

y = tf.print(y, summarize=-1)
y_out = sess.run(y, feed_dict={
        x: trX[:4].reshape((-1,1))
 	
    })


print(y_out)
````