Update usage with infinity

#15

Ready for review.

docker run --gpus all -p "7997":"7997" michaelf34/infinity:0.0.70 v2 --model-id TencentBAC/Conan-embedding-v1 --dtype float16 --batch-size 32 --engine to
rch --port 7997
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO     2024-11-26 17:54:25,299 infinity_emb INFO:        infinity_server.py:92
         Creating 1engines:                                                     
         engines=['TencentBAC/Conan-embedding-v1']                              
INFO     2024-11-26 17:54:25,302 infinity_emb INFO: Anonymized   telemetry.py:30
         telemetry can be disabled via environment variable                     
         `DO_NOT_TRACK=1`.                                                      
INFO     2024-11-26 17:54:25,308 infinity_emb INFO:           select_model.py:64
         model=`TencentBAC/Conan-embedding-v1` selected,                        
         using engine=`torch` and device=`None`                                 
INFO     2024-11-26 17:54:25,644                      SentenceTransformer.py:216
         sentence_transformers.SentenceTransformer                              
         INFO: Load pretrained SentenceTransformer:                             
         TencentBAC/Conan-embedding-v1                                          
INFO     2024-11-26 17:54:47,041 infinity_emb INFO: Adding    acceleration.py:56
         optimizations via Huggingface optimum.                                 
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
/app/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py:301: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
INFO     2024-11-26 17:54:47,509 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=32 and avg tokens per                           
         sentence=2                                                             
                 2.15     ms tokenization                                       
                 17.79    ms inference                                          
                 0.38     ms post-processing                                    
                 20.32    ms total                                              
         embeddings/sec: 1575.01                                                
INFO     2024-11-26 17:54:48,376 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=32 and avg tokens per                           
         sentence=512                                                           
                 14.81    ms tokenization                                       
                 398.44   ms inference                                          
                 0.52     ms post-processing                                    
                 413.77   ms total                                              
         embeddings/sec: 77.34                                                  
INFO     2024-11-26 17:54:48,381 infinity_emb INFO: model    select_model.py:104
         warmed up, between 77.34-1575.01 embeddings/sec at                     
         batch_size=32                                                          
INFO     2024-11-26 17:54:48,383 infinity_emb INFO:         batch_handler.py:443
         creating batching engine                                               
INFO     2024-11-26 17:54:48,385 infinity_emb INFO: ready   batch_handler.py:512
         to batch requests.                                                     
INFO     2024-11-26 17:54:48,388 infinity_emb INFO:       infinity_server.py:106
                                                                                
         ♾️  Infinity - Embedding Inference Server                               
         MIT License; Copyright (c) 2023-now Michael Feil                       
         Version 0.0.70                                                         
                                                                                
         Open the Docs via Swagger UI:                                          
         http://0.0.0.0:7997/docs                                               
                                                                                
         Access all deployed models via 'GET':                                  
         curl http://0.0.0.0:7997/models                                        
                                                                                
         Visit the docs for more information:                                   
         https://michaelfeil.github.io/infinity                                 
                                                                                
                                                                                
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment