MBARTHEZ-QG / README.md

Update README.md

e8d3b32 almost 2 years ago

4.08 kB

	# A MBARTHEZ MODEL TRAINED FOR QUESTION GENERATION

	## Training

	The model has been trained on different french and english corpus (FQuAD, PIAF and SQuAD)

	## Generate
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Getting the data
	access_token = "hf_......"
	tokenizer = AutoTokenizer.from_pretrained("ThomasGerald/MBARTHEZ-QG", use_auth_token=access_token)
	model = AutoModelForSeq2SeqLM.from_pretrained("ThomasGerald/MBARTHEZ-QG", use_auth_token=access_token)

	# text input exemple notice we use the token <hl> to delimite the support of the question
	text = ("La recherche moderne considère généralement que la langue grecque n'est pas née en Grèce," +
	"mais elle n'est pas arrivée à un consensus quant à la date d'arrivée des groupes parlant un "+
	"« proto-grec », qui s'est produite durant des phases préhistoriques pour lesquelles il n'y a"+
	"pas de texte indiquant quelles langues étaient parlées. Les premiers textes écrits en grec <hl>sont"+
	"les tablettes en linéaire B de l'époque mycénienne<hl>, au XIVe siècle av. J.-C., ce qui indique que"+
	"des personnes parlant un dialecte grec sont présentes en Grèce au plus tard durant cette période."+
	" La linguistique n'est pas en mesure de trancher, pas plus que l'archéologie.")

	tokenized_text = tokenizer([text], return_tensors="pt")

	# Output conditionnaly to the language (here two tokens possible '[fr_XX]' and '[en_XX]')
	output_ids = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.convert_tokens_to_ids(['[fr_XX]']))

	# Decoding
	output = tokenizer.batch_decode(output_ids, skip_special_tokens=False)

	# output:
	'''['</s>[fr_XX] Quels sont les premiers textes écrits en grec?</s>']'''

	```


	We can also generate question in english from french context by specifying the begining of sentence token ('[en_XX]'). Considering the previous code prepending the following one we can generate english questions executing :
	``` python
	output_ids = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.convert_tokens_to_ids(['[en_XX]']))
	output = tokenizer.batch_decode(output_idsskip_special_tokens=False)

	# output:
	'''['</s>[en_XX] What are the first texts written in grec?</s>']'''
	```


	Of course you can also generate questions from english text :
	``` python
	# text input exemple notice we use the token <hl> to delimite the support of the question
	text = ("By 371 BC, Thebes was in the ascendancy, defeating Sparta at" +
	"<hl>the Battle of Leuctra<hl>, killing the Spartan king Cleombrotus I" +
	", and invading Laconia. Further Theban successes against Sparta" +
	"in 369 led to Messenia gaining independence; Sparta never recovered" +
	"from the loss of Messenia's fertile land and the helot workforce it" +
	"provided.[50] The rising power of Thebes led Sparta and Athens to join" +
	"forces; in 362 they were defeated by Thebes at the Battle of Mantinea." +
	" In the aftermath of Mantinea, none of the major Greek states were able" +
	"to dominate. Though Thebes had won the battle, their general Epaminondas" +
	"was killed, and they spent the following decades embroiled in wars with"+
	"their neighbours; Athens, meanwhile, saw its second naval alliance," +
	" formed in 377, collapse in the mid-350s.")

	tokenized_text = tokenizer([text], return_tensors="pt")

	# French question
	output_ids = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.convert_tokens_to_ids(['[fr_XX]']))

	# Decoding
	output = tokenizer.batch_decode(output_ids, skip_special_tokens=False)

	# Notice it does not translate "Sparta" which is "Sparte" in french
	'''['</s>[fr_XX] À quelle bataille Sparta a-t-il été vaincu par Thebes?</s>']'''

	# English question
	output_ids = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.convert_tokens_to_ids(['[en_XX]']))

	# Decoding
	output = tokenizer.batch_decode(output_ids, skip_special_tokens=False)

	'''['</s>[en_XX] At what battle did Thebes defeat Sparta?</s>']'''

	```