Skip to content

vLLM

vLLM provides an HTTP server that implements OpenAI’s Completions and Chat API.

Send the first inference using our API and vLLM. Use the address of your endpoint, and include:

Chat Completions

/v1/chat/completions

Example: https://71bd1b92256e.app.modelserve.ai/v1/chat/completions

Chat Completions are a specialized type of Completion model designed for conversational interactions, where the system responds to messages within the context of a conversation. These models are optimized for conducting dialogues and are trained to consider the context of previous messages in the conversation.

Here are some key features and applications of Chat Completion models:

  • Customer support: Automatically responding to customer inquiries, resolving issues, and providing information.
  • Virtual assistants: Helping with task management, answering questions, and assisting with daily activities.
  • Education and training: Interactive learning tools that can answer student questions and conduct interactive lessons.
  • Entertainment: Creating game characters that can engage in realistic conversations with players.

An example Chat Completions API call looks like the following:

Payload:

{
  "model": "mistralai/Mistral-7B-Instruct-v0.3",
  "max_tokens": 300,
  "temperature": 0,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Who won the World Cup in 2022?"
    }
  ]
}
curl -s -X POST \
     -H 'Accept: application/json' \
     -H 'Content-Type: application/json' \
     -H 'Authorization: Bearer X' \
     -d '{"model": "mistralai/Mistral-7B-Instruct-v0.3", "max_tokens": 300, "temperature": 0, "messages": [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": "Who won the World Cup in 2022?"}]}' \
     'https://{address}/v1/chat/completions'
import requests

r = requests.post(
    "https://{address}/v1/chat/completions",
    headers={
        "Accept": "application/json",
        "Content-Type": "application/json",
        "Authorization": "Bearer X",
    },
    data={
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "max_tokens": 300,
        "temperature": 0,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the World Cup in 2022?"},
        ],
    },
)
fetch('https://{address}/v1/chat/completions', {
  "method": "POST",
  "headers": {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": "Bearer X"
  },
  "body": JSON.stringify({"model": "mistralai/Mistral-7B-Instruct-v0.3", "max_tokens": 300, "temperature": 0, "messages": [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": "Who won the World Cup in 2022?"}]})
});

Remember to replace the "X" in Authorization with your real Access Token. Where to find your Access Token (Bearer)? Learn more in the 🚀 Quickstart section.

Learn more about the capabilities of the vLLM API by clicking below. Also, check the Notebooks section where you will find ready-made solutions to test.

Learn more

Embeddings

/v1/embeddings

Example: https://71bd1b92256e.app.modelserve.ai/v1/embeddings

The Embeddings model transforms text into numerical vectors (called embeddings) that represent the semantic meaning of the text.

Here are some key features and applications of Embeddings models:

  • Semantic search: Finding texts that are semantically similar to a given query.
  • Text classification: Assigning categories to text based on its semantics.
  • Sentiment analysis: Assessing the emotions expressed in a text.
  • Text clustering: Grouping texts with similar content.
  • Text comparison: Evaluating the degree of similarity between different text fragments.
  • Recommendations: Suggesting content based on its similarity to other texts.

An example Embeddings API call looks like the following:

Payload:

{
  "input": "Your text string goes here"
}
curl -s -X POST \
     -H 'Accept: application/json' \
     -H 'Content-Type: application/json' \
     -H 'Authorization: Bearer X' \
     -d '{"input": "Your text string goes here"}' \
     'https://{address}/v1/embeddings'
import requests

r = requests.post(
    "https://{address}/v1/embeddings",
    headers={
        "Accept": "application/json",
        "Content-Type": "application/json",
        "Authorization": "Bearer X",
    },
    data={"input": "Your text string goes here"},
)
fetch('https://{address}/v1/embeddings', {
  "method": "POST",
  "headers": {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": "Bearer X"
  },
  "body": JSON.stringify({"input": "Your text string goes here"})
});

Remember to replace the "X" in Authorization with your real Access Token. Where to find your Access Token (Bearer)? Learn more in the 🚀 Quickstart section.

Learn more about the capabilities of the vLLM API by clicking below. Also, check the Notebooks section where you will find ready-made solutions to test.

Learn more