Spaces:

flax-community
/

multilingual-image-captioning

Runtime error

App Files Files Community

bhavitvyamalik commited on Jul 21, 2021

Commit

9373f8f

1 Parent(s): f5ecd98

improve sections

Browse files

Files changed (2) hide show

sections/challenges.md +3 -3
sections/future_scope.md +1 -2

sections/challenges.md CHANGED Viewed

@@ -1,10 +1,10 @@
 ## Challenges and Technical Difficulties
 Training image captioning that too multilingual was a difficult task and we faced challenges at almost every point of this process.
-- Dataset: Our initial plan was to translate Conceptual 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. Then, we translated the whole dataset using MBart50 for all languages which took around 3-4 days. Further on, we realised that mBART captions were not that good and model was not converging because of that, causing us to re-translate our captions with [Marian](https://huggingface.co/transformers/model_doc/marian.html)
 - We prepared the model and config classes for our model from scratch, basing it on `CLIP model based on ViT-B/32 Image Transformer` and `mBART50` implementations in FLAX. The CLIP embeddings were to be used inside the mBART50 embeddings class, which was the major challenge here.
-- RAM issues: Loading and training 10M image-caption dataset led to huge amount of RAM consumption on TPU (~200GB in the first few steps) because of which we had to optimize the script, use less data, and use less `num_workers` in order to avoid this issue.
-- We were only able to get around 2-3 days of training time on TPUs due to aformentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training model](https://huggingface.co/flax-community/clip-vit-base-patch32_mbart-large-50/tensorboard) show that the training hasn't converged, and we could see further improvement in the loss.

 ## Challenges and Technical Difficulties
 Training image captioning that too multilingual was a difficult task and we faced challenges at almost every point of this process.
+- Dataset: Our initial plan was to translate Conceptual 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. We translated the whole dataset using MBart50 for all languages which took around 3-4 days. Further on, we realised that mBART captions were not that good and model was not converging because of that which lead us to re-translate our captions with [Marian](https://huggingface.co/transformers/model_doc/marian.html)
 - We prepared the model and config classes for our model from scratch, basing it on `CLIP model based on ViT-B/32 Image Transformer` and `mBART50` implementations in FLAX. The CLIP embeddings were to be used inside the mBART50 embeddings class, which was the major challenge here.
+- RAM issues: Loading and training 10M image-caption dataset led to huge amount of RAM consumption on TPU (~200GB in the first few steps) because of which we had to optimize the script, use less data, and use less `num_workers` in order to avoid this issue. This also caused our training to slow down.
+- We were only able to get around 2-3 days of training time on TPUs due to aformentioned challenges. We were unable to perform hyperparameter tuning.

sections/future_scope.md CHANGED Viewed

@@ -1,5 +1,4 @@
 ## Future scope of work
 We hope to improve this project in the future by using:
-- Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
-- More training time: We found that training image captioning model for an epoch takes a lot of compute time and if we want to replicate the same then the training time goes up manyfold for the same number of samples.
 - Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. Hopefully we'll be able to support TFLite for our model as well in future.

 ## Future scope of work
 We hope to improve this project in the future by using:
+- Better options for data translation: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
 - Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. Hopefully we'll be able to support TFLite for our model as well in future.