Advantages and use cases of Impala AI functions

You can use Impala's ai_generate_text function to access Large Language Models (LLMs) in SQL queries. This function enables you to input a prompt, retrieve the LLM response, and include it in results. You can create custom UDFs for complex tasks like sentiment analysis and translation.

Use LLMs directly in SQL with Impala's ai_generate_text function>

Impala introduces a built-in AI function called ai_generate_text that enables direct access to and utilization of Large Language Models (LLMs) in SQL queries. With this function, you can input a prompt, which may include data. The function communicates with a supported LLM endpoint, sends the prompt, retrieves the response, and includes it in the query result.

Alternatively, seamlessly integrate LLM intelligence into your Impala workflow by creating custom User Defined Functions (UDFs) on top of ai_generate_text. This allows you to use concise SQL statements for sending prompts to an LLM and receiving responses. You can define UDFs for complex tasks like sentiment analysis, language translation, and generative contextual analysis.

Advantages of using AI functions

  • Simplified Workflow: Eliminates the necessity for setting up intricate data pipelines.
  • No ML Expertise Required: No specialized machine learning skills are needed.
  • Swift Decision-Making: Enables faster insights on the data, facilitating critical business decisions by using in-database function calls.
  • Integrated Functionality: Requires no external applications, as it is a built-in feature in Data Warehouse.

List of possible use cases

Here are some practical applications of using AI models with the function:
  • Sentiment Analysis: Use the AI model to examine customer reviews for a product and identify their sentiment as positive, negative, or neutral.
  • Language Translation: Translate product reviews written in different languages to understand customer feedback from various regions.
  • Generative Contextual Analysis: Generate detailed reports and insights on various topics based on provided data.

Syntax for AI built-in function arguments

The following example of a built-in AI function demonstrates the use of the OpenAI API as a large language model. Currently, OpenAI's public endpoint and Azure OpenAI endpoints are supported.

AI_GENERATE_TEXT_DEFAULT
Syntax:
ai_generate_text_default(prompt)
AI_GENERATE_TEXT
Syntax:
ai_generate_text(ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params)
          

The ai_generate_text function uses the values you provide as an argument in the function for ai_endpoint, ai_model, and ai_api_key_jceks_secret. If any of the arguments are left empty or set to NULL, the function uses the default values defined at the instance level. These default values correspond to the flag settings configured in the Impala instance. For example, if the ai_endpoint argument is NULL or empty, the function will use the value specified by the ai_endpoint flag as the default.

When using the ai_generate_text_default function, make sure to set all parameters (ai_endpoint, ai_model, and ai_api_key_jceks_secret) in the coordinator/executor flagfiles with appropriate values.

Key parameters for using the AI model

  • ai_endpoint: The endpoint for the model API that is being interfaced with, supports services like OpenAI and Azure OpenAI Service, for example, https://api.openai.com/v1/chat/completions.
  • prompt: The text you submit to the AI model to generate a response.
  • ai_model: The specific model name you want to use within the desired API, for example, gpt-3.5-turbo.
  • ai_api_key_jceks_secret: The key name for the JCEKS secret that contains your API key for the AI API you are using. You need a JCEKS keystore containing the specified JCEKS secret referenced in ai_api_key_jceks_secret. To do this, set the hadoop.security.credential.provider.path property in the core-site configuration for both the executor and coordinator.
  • additional_params: Additional parameters that the AI API offers that is provided to the built-in function as a JSON object.

Examples of using the built-in AI function

The following example lists the steps needed to turn a prompt into a custom SQL function using just the built-in function ai_generate_text_default.
> select ai_generate_text_default('hello');
  Response:
  Hello! How can I assist you today?
        
In the below example, a query is sent to the Amazon book reviews database for the book titled Artificial Superintelligence. The large language model (LLM) is prompted to classify the sentiment as positive, neutral, or negative.
> select customer_id, star_rating, ai_generate_text_default(CONCAT('Classify the following review as positive, neutral, or negative', and only include the uncapitalized category in the response: ', review_body)) AS review_analysis, review_body from amazon_book_reviews where product_title='Artificial Superintelligence' order by customer_id LIMIT 1;
  Response:
  +--+------------+------------+----------------+------------------+
  |  |customer_id |star_rating |review_analysis |review_body       |
  +--+------------+------------+----------------+------------------+
  |1 |4343565     | 5          |positive        |What is this book |
  |  |            |            |                |all about …………    |
  +--+------------+------------+----------------+------------------+
        

Examples of creating and using custom UDFs along with the built-in AI function

Instead of writing the prompts in a SQL query, you can build a UDF with your intended prompt. Once you build your custom UDF, pass your desired prompt within your custom UDF into the ai_generate_text_default built-in Impala function.

Example: Classify input customer reviews

The following UDF uses the Amazon book reviews database as the input and requests the LLM to classify the sentiment.

Classify input customer reviews:


    IMPALA_UDF_EXPORT
    StringVal ClassifyReviews(FunctionContext* context, const StringVal& input) {
      std::string request =
        std::string("Classify the following review as positive, neutral, or negative")
        + std::string(" and only include the uncapitalized category in the response: ")
        + std::string(reinterpret_cast<const char*>(input.ptr), input.len);
      StringVal prompt(request.c_str());
      const StringVal endpoint("https://api.openai.com/v1/chat/completions");
      const StringVal model("gpt-3.5-turbo");
      const StringVal api_key_jceks_secret("open-ai-key");
      const StringVal params("{\"temperature\": 0.9, \"model\": \"gpt-4\"}");
      return context->Functions()->ai_generate_text(
        context, endpoint, prompt, model, api_key_jceks_secret, params);
    }
      

Now you can define these prompt building UDFs and build them in Impala. Once you have them running, you can query your datasets using them.

Creating analyze_reviews function:

> CREATE FUNCTION analyze_reviews(STRING)
    RETURNS STRING
    LOCATION ‘s3a://dw-...............’
    SYMBOL=’ClassifyReviews’
      

Using SELECT query for Sentiment analysis to classify Amazon book reviews

> SELECT customer_id, star_rating, analyze_reviews(review_body) AS review_analysis, review_body from amazon_book_reviews where product_title='Artificial Superintelligence' order by customer_id;
  Response:
  +--+------------+------------+----------------+----------------------+
  |  |customer_id |star_rating |review_analysis |review_body           |
  +--+------------+------------+----------------+----------------------+
  |1 |44254093    | 5          |positive        |What is this book all |
  |  |            |            |                |about? It is all about|
  |  |            |            |                |a mind-blowing        |
  |  |            |            |                |universal law of      |
  |  |            |            |                |nature. Mind-blow…    |
  +--+------------+------------+----------------+----------------------+
  |2 |50050072    | 5          |positive        |The two tightly-      |
  |  |            |            |                |connected ideas strike|
  |  |            |            |                |you as amazed. In the |
  |  |            |            |                |first place, what has |
  |  |            |            |                |never bef…            |
  +--+------------+------------+----------------+----------------------+
  |3 |50050072    | 5          |positive        |The two tightly-      |
  |  |            |            |                |connected ideas strike|
  |  |            |            |                |you as amazed. In the |
  |  |            |            |                |first place, what has |
  |  |            |            |                |never bef…            |
  +--+------------+------------+----------------+----------------------+
  |4 |52932308    | 1          |negative        |This book is seriously|
  |  |            |            |                |flawed. I could not   |
  |  |            |            |                |work out if the author|
  |  |            |            |                |was a mathemetician   |
  |  |            |            |                |dabbi…                |
  +--+------------+------------+----------------+----------------------+
  |5 |52971961    | 1          |negative        |Abdoullaev's          |
  |  |            |            |                |exploration of        |
  |  |            |            |                |Al issues appears to  |
  |  |            |            |                |be very technological |
  |  |            |            |                |and straightforward…  |
  +--+------------+------------+----------------+----------------------+
  |6 |53008416    | 4          |positive        |As Co Founder of      |
  |  |            |            |                |ArtilectWorld:ultra   |
  |  |            |            |                |intelligent machine,  |
  |  |            |            |                |I recommend reading   |
  |  |            |            |                |this book!            |
  +--+------------+------------+----------------+----------------------+