Suggestions for faster inference.

#6
by SamAct - opened

Recently have struggled running these model locally, with some runs taking over an hour to quench out a set of paragraphs.

  1. Any suggestion on how to improve my run time?
  2. Do we have any outlook on how to setup onnx conversions?
  3. What else can be done?

Hey! Thanks for reaching out. I’ll do my best to answer with reasonable formatting (on mobile at the moment):

  1. What does your setup look like? More specifically, are you running this on a GPU? I realized that it’s not on the model card, but inference on CPU with summarization models (especially if you are trying to summarize 16384 tokens at once) takes forever due to the length of the inputs even with methods to make this more efficient.
  • there is an example on pszemraj/led-large-book-summary Now of how to do this, I’ll add to this card later 👍
  • for LED models, the way text is encoded and decoded really matter. Check out this notebook for a full example of what that means and how it works.
  1. unsure on ONNX but I’ll look into it over the coming weeks
  2. I would first make sure things are set up correctly on GPU side, try the notebook above. Then, I would try adjusting the parameters. With the “token batching” approach in the notebook it will iterate through a file 4092, 8192 tokens at a time,
  • also, try preprocessing/simple text cleaning on inputs. I have not found out why yet but “compute workload” can vary drastically among the same amount of text depending on composition with no obvious difference like one text was written by a baby or something.
  • if you are on GPU and still having issues even after adjusting params, it’s possible we could try the new 8bit inference. Let me know if you still have issues

Happy to answer other questions too, just let me know 👍👍

Interesting. Thank you for this brilliant piece of advise. Based on your inputs I consolidated the notebook as below. I am trying to use this notebook to work on fine tuning the model. Some of the values for extravagant runs are also borrowed from your model cards. I will try to play around and see what's work best.

I am on a mid tired computer (mid RTX and old cpu) configuration, but with the code below ran within a 30 seconds to a minute most of the time. Of course, I need to validate the outputs with that of the model card.
I am planning to do a in-depth validation with multiple parameters next.

Note: I have found that skipping max_length input especially when the input_length is less then 1024 tokens, drastically increases the run time.
In the mean time, Could you explain the point 3 token_batching in the list above? Thank you again for you answer.

def alt_led_on_cuda(newchunk, model_path, max_length = 512, extravagant=False):
    _model = LEDForConditionalGeneration.from_pretrained(model_path, low_cpu_mem_usage=True,
    torch_dtype="auto"                  
                ).to("cuda").half()
    _tokenizer = define_tokenizer(model=0, hf_name=model_path
                                  )
    inputs_dict = _tokenizer(newchunk, padding="max_length", max_length=16384, return_tensors="pt", truncation=True)
    input_ids = inputs_dict.input_ids.to("cuda")
    attention_mask = inputs_dict.attention_mask.to("cuda")
    global_attention_mask = torch.zeros_like(attention_mask)
    global_attention_mask[:, 0] = 1
    if extravagant:
        predicted_abstract_ids = _model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=max_length, num_beams=4, \
                                                 do_sample=False, no_repeat_ngram_size=3, \
                                                                                         encoder_no_repeat_ngram_size =3,\
                                                                                         repetition_penalty=3.7,\
                                                                                         early_stopping=True)
    else:
        predicted_abstract_ids = _model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask, max_length=max_length, num_beams=4)
    result = _tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
    return result

Nice! that notebook is closely aligned with what I used to fine-tune as well. It should work fine. Also, interesting finding on the inference time! Good to know. Most of my usage of the model is with 4096-8092 tokens input at a time, so I haven't explored that domain much.

For token batching, I forgot to link you to the notebook I use for summarizing more text than the model can handle at once. I went back and cleaned it up a bit and now also put it on this model's card; it's here. It should illustrate the concept pretty well, but by "token batching" I mean the process of:

  1. tokenize the entire body of text into batches of token_batch_length tokens each, overlapping (repeating same tokens) by batch_stride tokens.
  • I usually set batch_stride to 20 or so, about a sentence
  1. run summarization model on all batches
  2. check the model output probability scores / read to make sure things make sense

I tried the led-base on text that was 5000 words long, first chunked the text into batches of 1000 then summarized the list of batches, took quite a while on CPU, any idea how I can speed it up by varying some of the args? thanks

Hey! I will answer you on both threads - sorry for the delay. In general, things that can affect/improve/make runtime more consistent:

  • try chunking your text in X tokens as opposed to words. Sometimes numbers and other digits can screw up the counts, and the reality is that the tokens are what matters 9_i.e. if you have a lot of words that map to 4+ tokens or so, that batch might take forever). example code here
  • decrease num_beams to 1 for greedy search decoding
  • then, you can remove the penalties: set length_penalty=1 and/or repetition_penalty=1. While you can get rid of them, I think some form of preventing repetition is likely needed, so I would try keeping no_repeat_ngram_size=3 etc.

Try those, but I think the long-token models are just compute-intensive. I think spaces used to have more resources (just a feeling I get with compute times now), but if it can't run on CPU on spaces, it's probably not viable without a GPU. You could also try the longt5-base model on my profile and see if that is more efficient

I'm going to close this for now, but if there are any issues related to the parameters, feel free to open it again!

pszemraj changed discussion status to closed

Sign up or log in to comment