cua cà mau cua tươi sống cua cà mau bao nhiêu 1kg giá cua hôm nay giá cua cà mau hôm nay cua thịt cà mau cua biển cua biển cà mau cách luộc cua cà mau cua gạch cua gạch cà mau vựa cua cà mau lẩu cua cà mau giá cua thịt cà mau hôm nay giá cua gạch cà mau giá cua gạch cách hấp cua cà mau cua cốm cà mau cua hấp mua cua cà mau cua ca mau ban cua ca mau cua cà mau giá rẻ cua biển tươi cuaganic cua cua thịt cà mau cua gạch cà mau cua cà mau gần đây hải sản cà mau cua gạch son cua đầy gạch giá rẻ các loại cua ở việt nam các loại cua biển ở việt nam cua ngon cua giá rẻ cua gia re crab farming crab farming cua cà mau cua cà mau cua tươi sống cua tươi sống cua cà mau bao nhiêu 1kg giá cua hôm nay giá cua cà mau hôm nay cua thịt cà mau cua biển cua biển cà mau cách luộc cua cà mau cua gạch cua gạch cà mau vựa cua cà mau lẩu cua cà mau giá cua thịt cà mau hôm nay giá cua gạch cà mau giá cua gạch cách hấp cua cà mau cua cốm cà mau cua hấp mua cua cà mau cua ca mau ban cua ca mau cua cà mau giá rẻ cua biển tươi cuaganic cua cua thịt cà mau cua gạch cà mau cua cà mau gần đây hải sản cà mau cua gạch son cua đầy gạch giá rẻ các loại cua ở việt nam các loại cua biển ở việt nam cua ngon cua giá rẻ cua gia re crab farming crab farming cua cà mau
Skip to main content

What is an AI token?

A presenter at Google IO shows information on a new AI project.
Google

Google recently announced that Gemini 1.5 Pro would increase from a 1 million token context window to 2 million. That sounds impressive, but what in the world is a token anyways?

At its core, even chatbots need help processing the text they get so they can understand concepts and communicate with you in a human-like fashion. This is accomplished using a token system in the generative AI space that breaks down data so it is more easily digestible by AI models.

What is an AI token?

An infograph highlighting Gemini's 1 million token long context window capability.
Google

An AI token is the smallest unit a word or phrase can be broken down into when being processed by a large language model (LLM). Tokens account for words, punctuation marks, or subwords, which allow models to efficiently analyze and interpret text and, subsequently, generate content in a similar unit-based fashion. This is similar to how a computer will convert data into Unicode zeros and ones for easier processing. Tokens allow a model to determine a pattern or relationship within words and phrases so they can predict future terms and respond in the context of your prompt.

When you input a prompt, the phrase and words are too long for a chatbot to interpret as is – they must be broken down into smaller pieces before the LLM can even process the request. They are converted into tokens, then the request is submitted and analyzed, and a response is returned to you.

The process of turning text into tokens is called tokenization. There are many tokenization methods, which can differ based on variants, including dictionary instructions, word combinations, language, etc. For example, the space-based tokenization method splits words up based on the spaces between them. The phrase “It’s raining outside” would be split into the tokens ‘It’s’, ‘raining’, ‘outside’.

How do AI tokens work?

The general token conversion breakdown followed in the generative AI space denotes that one token equals approximately four characters in English — or 3/4 of a word — and 100 tokens equals approximately 75 words. Other conversions suggest one to two sentences equals about 30 tokens, one paragraph equals about 100 tokens, and 1,500 words equals about 2,048 tokens.

Whether you’re a general user, a developer, or an enterprise, the AI program you’re using is employing tokens to perform its tasks. Once you begin paying for generative AI services, you’re paying for tokens to maintain the service at its optimum level.

Most generative AI brands also have basic rules around how tokens function on their AI models. Many companies have token limitations, which put a cap on the number of tokens that can be processed in one turn. If the request is larger than the token limit on an LLM, the tool won’t be able to complete a request in a single turn. For example, if you input a 10,000-word article for translation into a GPT with a 4,096-token limit, it won’t be able to process it fully to give a detailed answer because such a request would require at least 15,000 tokens.

However, companies have quickly been advancing the capabilities of their LLMs, adding to the token limitation with new versions. Google’s research-based BERT model had a maximum input length of 512 tokens. OpenAI’s GPT-3.5 LLM, which runs the free version of ChatGPT, has a max of 4,096 input tokens, while its GPT-4 LLM, which runs the paid version of ChatGPT, has a max of 32,768 input tokens. This equates to approximately 64,000 words or 50 pages of text.

Google’s Gemini 1.5 Pro which provides audio functionality to the brand’s AI Studio has a standard 128,000 token context window. The Claude 2.1 LLM has a limit of up to 200,000 context tokens. This equates to approximately 150,000 words or 500 pages of text.

What are the different types of AI tokens?

There are several types of tokens used in the generative AI space that allow LLMs to identify the smallest units available for analysis. Here are some of the main tokens that are of interest to an AI model.

  • Word Tokens are words that represent single units on their own, such as “bird,” “house,” or “television.”
  • Sub-word Tokens are words that can be truncated into smaller units, such as splitting Tuesday into “Tues” and “day.”
  • Punctuation Tokens take the place of punctuation marks, including commas (,), periods (.), and others.
  • Number Tokens take the place of numerical figures, including the number “10.”
    Special Tokens can note several unique instructions within executing queries and training data.

What are the benefits of tokens?

There are several benefits to tokens in the generative AI space. Primarily, they act as a connector between human language and computer language when working with LLMs and other AI processes. Tokens help models process large amounts of data at once, which is especially beneficial in enterprise spaces that use LLMs. Companies can work with token limits to optimize the performance of AI models. As future LLM versions are introduced, tokens will allow models to have a larger memory through higher limits or context windows.

Other benefits of tokens lie in the training aspects of LLMs. Since they are small units, they can be used to make it easier to optimize the speed of processing data. Due to the predictive nature of tokens, they have a greater understanding of concepts and improve sequences over time. Tokens assist in implementing multimodal aspects such as images, videos, and audio into LLMs alongside text-to-speech chatbots.

Tokens also have some data security and cost-efficiency benefits, due to their Unicode setup protecting vital data and truncating longer text into a simplified version.

Fionna Agomuoh
Fionna Agomuoh is a Computing Writer at Digital Trends. She covers a range of topics in the computing space, including…
More proof that AI images are becoming modern-day clip art
A screenshot of Gemini in Google Docs being used.

Google has announced that Gemini-based AI image generation will soon be built right into Google Docs. This is a follow-up announcement to the introduction of stock cover photos in Google Docs in September. Except now, they'd be your own custom, AI-based images.

According to Google's blog post: "The ability to generate unique images with Gemini in Docs empowers everyone, regardless of artistic skill, to create differentiated and visually compelling content. Now, you can communicate ideas more effectively, without having to tirelessly search for the perfect image."

Read more
YouTube’s new AI music remixer could let you swap genres
The red and white YouTube logo on a phone screen. The phone is on a white background.

Musicians could soon be able to remix the songs that they upload to YouTube thanks to an experimental AI tool currently rolling out to select content creators.

The new tool is built atop YouTube's Dream Track, which was released last year and enables users to compose songs based on text prompts and by using prerecorded vocals. Charli XCX, Demi Lovato, John Legend, Sia, T-Pain, and Charlie Puth have all signed on for the use of their vocal likenesses.

Read more
Bluesky has ‘no intention’ to train generative AI on user content
Bluesky on the App Store, displayed on iPhone 16 Plus.

After adding its 16 millionth user to the platform on Friday morning, social media platform Bluesky addressed concerns from the bevy of artists and content creators streaming over from X.com. The company has pledged that it has "no intention" of using their posted content to train generative AI.

https://bsky.app/profile/bsky.app/post/3layuzbto2c2x

Read more