In the ever-evolving landscape of AI, our team embarked on a journey to enhance our platform by incorporating a practical AI block feature. The possibilities seemed limitless, ranging from facial recognition for attendance monitoring to extracting vital information from images, such as analyzing the quality and detecting diseases in plants through image analysis. After successfully validating these use-cases through a Proof of Concept (POC), we were eager to integrate this AI capability into our platform.
Choosing the right AI API was a critical decision, and after careful consideration, we opted for the OpenAI ChatGPT API. Our exploration led us to this choice, as alternatives like Bard fell short in delivering satisfactory responses, and Gemini had not yet opened up its API to the public. OpenAI's GPT API demonstrated superior performance in handling diverse tasks. With a clear direction, we set our sights on leveraging the power of the OpenAI GPT API to bring our AI block feature to life.
For text analysis, we've adopted the straightforward approach of employing the gpt-3.5-turbo model. Input for text analysis is facilitated through dedicated text blocks, allowing users to provide instructions for the subsequent evaluation. We've strategically implemented constraints, capping the maximum input length at 500 characters, and limiting the output to 100 characters for succinct and manageable results.
For our image analysis feature, we've incorporated the powerful 'gpt-4-vision-preview' model, an extension of GPT-4 Turbo with advanced image comprehension capabilities. Currently, it references 'gpt-4-1106-vision-preview.' Users can seamlessly upload images using our existing file block, which is then directed to S3.
Here's an overview of our image analysis workflow:
Optimizations for Multiple Images:
The cost of image analysis is calculated based on tokens, factoring in the image size and the detail
option. For images with detail: low
, the cost is a fixed 85 tokens. For detail: high
images, they are resized to fit within a 2048 x 2048 square, maintaining aspect ratio, and then scaled down to a minimum of 768px on the shortest side. The total token cost is calculated based on the number of 512px squares required, with each square costing 170 tokens, plus an additional 85 tokens. Examples illustrate the token cost for different image sizes and detail
options, ensuring a transparent and scalable pricing structure.
In the AI block, instructions are mandated to be non-empty, requiring users to provide either static or dependency-based guidance. This ensures purposeful utilization, preventing empty instruction scenarios. For tasks like extracting a restaurant bill amount with GST considerations, users articulate specific instructions such as, "Extract the total bill amount from the provided @restaurantFileBlock." This approach fosters a nuanced and effective use of the AI block, emphasizing clarity and user intentionality, while minimizing the potential for ambiguous or redundant queries.
To manage usage effectively, we've implemented rate limiting for each workplace, currently set at 100 triggers. This cautious approach during the initial stages allows users to explore the feature while maintaining a balanced workload on our AI resources.
To ensure controlled access to the AI feature, only individuals with authorized access to the app are eligible to utilize it. In the context of public sharing, if an app has public sharing enabled and a user lacks the necessary authorization, attempts to use the AI block will result in an error. This restriction applies specifically to users who do not possess the requisite authorization credentials.
Additionally, users are barred from utilizing the AI feature in the context of embedded and publicly shared apps. This stringent approach ensures that the AI capabilities remain within the intended user base and aligns with the access control policies established for the application.
Implementing an efficient caching mechanism has been a pivotal aspect of our system, particularly for instances where the same request with identical field IDs for a form instance is encountered. This strategy significantly optimizes our response time and resource utilization. We've designed the caching system to recognize when a request shares the same field ID for a form instance. This ensures that redundant queries are avoided, optimizing the overall processing flow. This caching also works with image queries.
Within the AI block, intricate linkages may exist, involving multiple lines of text or numerous images. To maintain a streamlined process, the AI block is designed to trigger only when all associated dependencies are filled. If any of the required blocks are left empty, the AI block prompts the user to complete all missing dependencies before proceeding. This strategy not only enhances the effectiveness of the AI feature but also contributes to a more streamlined and cost-conscious user experience.
The detail
parameter in the AI model offers three options: low
, high
, or auto
, providing control over the image processing and textual generation. The default setting is auto
, where the model dynamically chooses between low
and high
based on the image input size. In low
mode, a 512px x 512px low-res version of the image is received, utilizing a budget of 85 tokens for faster responses in scenarios not demanding high detail. Conversely, high
mode allows the model to examine the low-res image and generate detailed 512px squares, using a total token budget of 129 tokens for more intricate representations.
To optimize cost efficiency, the output from the model has been capped at a maximum of 100 characters. This restriction aligns with the token-based charging system of GPT, allowing users to receive concise and relevant responses while managing and predicting token consumption effectively.
Looking ahead, our AI block envisions a series of enhancements to broaden its utility. One key area of focus is extending support to an offline plugin, enabling users to harness AI capabilities even without an active internet connection. Additionally, we aspire to diversify the file formats handled by the AI block, including but not limited to PDFs, videos, and signature blocks. These expansions aim to provide users with a more comprehensive and versatile AI experience, catering to a wider array of data types and user scenarios.
This blog post was originally published here.
140L, 5th Main Rd, Sector 6, HSR Layout, Bengaluru, Karnataka 560102, India
+91 96418 61031
3500 S DuPont Hwy, Dover,
Kent 19901, Delaware, USA
+1 (341) 209-1116
3500 S DuPont Hwy, Dover,
Kent 19901, Delaware, USA
+1 (341) 209-1116
140L, 5th Main Rd, Sector 6, HSR Layout, Bengaluru, Karnataka 560102, India
+91 96418 61031