Inference Endpoints Changelog 🚀

Community Article Published October 11, 2024

Week 46, Nov 11 - Nov 17

No changes this week as the team was on an off-site in Martinique! But a lot of ideas and energy cooked up for the coming week 🙌

image/jpeg

Week 45, Nov 04 - Nov 10

This week, we have some awesome updates that are finally out 🙌

  • Scaling replicas based on pending requests is now in beta 🔥 Since it's in beta, things might change, but you can try it out and read more about it here image/png
  • Improved analytics with a graph of the replica history image/png
  • Updates to the widgets
    • Fixed bug in streaming
    • Conversations can now be cleared
    • Submit message with cmd+enter

Week 44, Oct 28 - Nov 03

Probably the biggest update this week was a revamp to the Inference Catalogue 🔥 You can now with a one-click-deploy find a model based on:

  • license
  • price range
  • inference server
  • accelerator
  • and the previously existing task and search filters image/png

Additionally:

  • we fixed the config for MoritzLaurer/deberta-v3-large-zeroshot-v2.0 so that you can run it on CPU as well
  • and also thanks to @ngxson for fixing a bug in the llama.cpp snippet

Week 43, Oct 21-27

This week you'll get a sneak peak of the upcoming autoscaling, in the form of analytics 👀

We have:

  • Added pending http requests to the analytics
  • Support for Image-Text-To-Text, aka language vision models 🔥 (llama vision has some good jokes 😅) image/png
  • Improved the log pagination and added some nice visual touches
  • Fixed a bug related to total request count in the analytics

Week 42, Oct 14-20

This week was unfortunately slower on the user-facing updates.

Behind the scenes, we:

  • fixed several recommendation values for LLaMA and Qwen 2,
  • improved our internal analytics,
  • debugged issues related to weights downloading and getting 429s,
  • and hopefully squashed the last bugs so we can soon release the new autoscaling 🔥

Week 41, Oct 7-13

This week we had a lot of nice UI/UX improvements:

  • clearer error on models that are too large for any instance type, like for llama 405B 😅 image/png

  • better logs loading message if the endpoint isn't ready image/png

Additionally:

  • deprecated the "text2text-generation" tasks, it's been deprecated on the Hub and in the Inference API as well
  • you can now pass the "seed" parameter in the widget for diffuser models
  • small bug fixes on llama.cpp containers
  • you can directly play in the widget with openAI API parameters
  • Shoutout to Alvaro for making the NVLM-D-72B model compatible on endpoints 🙌

On the backend we're also making improvements to the autoscaling. This might not immediately have noticeable impact for user but soon it'll ripple to the front end as well. Stay tuned 👀