ML Sys Design: How AI Tags New Apps Before Anyone Installs Them

Solving the App Discovery Challenge: How AI Tags New Apps Before Anyone Installs Them

When a brand new app hits the platform, how do users find it? Without any installs, ratings, or engagement data, recommendation systems are flying blind. This is the cold-start discovery challenge.

Our solution? A cutting-edge, multimodal AI system designed to automatically generate high-quality, descriptive tags—like “photo editor” or “turn-based strategy”—as soon as an app is submitted, making discovery seamless from day one. This approach uses Large Language Models (LLMs) and computer vision to ensure new apps are accurately categorized and ranked effectively.


Stage 1: Problem Understanding

The core challenge is the cold-start discovery issue for newly submitted apps. Because new apps lack behavioral signals, it is nearly impossible for search and recommendation systems to rank them.

Our goal is to automatically generate descriptive tags that summarize the app’s function. To achieve this, we combine several data modalities: textual metadata (developer input), visual cues (screenshots and icons), and light feedback from early users or editors. These generated tags should improve discoverability by helping the ranking system understand the app’s semantic category and user intent.

The system input includes the developer’s title, description, keywords, and screenshots, potentially localized in multiple languages. The required output is a set of meaningful, policy-compliant tags, each with an associated confidence score. Crucially, the system must be multilingual, privacy-preserving, and fast, capable of tagging an app asynchronously within a few seconds of submission.

In essence, we are designing a robust multimodal tagging system that leverages LLMs, vision models, and feedback loops to label new apps, boosting their cold-start search ranking while prioritizing precision and trustworthiness.


Stage 2: High-Level Design

The AI pipeline kicks off the moment an app is submitted or updated.

  1. Ingestion & Preprocessing: Metadata and screenshots are ingested. Text is normalized, language-identified, and potentially translated into a pivot language (like English). Sensitive or personally identifiable information is carefully stripped out. The remaining content is segmented into defined fields (e.g., title, description).
  2. Vision Pipeline: Screenshots and icons are processed. A vision pipeline resizes the images, extracts embeddings using a CLIP-style model, and applies OCR (Optical Character Recognition) to capture any visible text within the app’s interface.
  3. Feature Store: The results from both text and vision modalities are written to a multimodal feature store. Here, we use an ANN index to compute nearest neighbors, retrieving tags from similar existing apps.
  4. Ensemble Tagging Core: The heart of the system is an ensemble of tagging models.
    • Text-Centric LLM Tagger: Reads preprocessed metadata and generates tags by reasoning over the content. This LLM is instruction-tuned to output structured tag sets instead of free text and uses constrained decoding to adhere to a curated taxonomy.
    • Complementary Vision Tagger: Uses a vision transformer or CLIP model to analyze screenshots and predict likely tags based on the app’s visual layout and content.
  5. Fusion and Filtering: A fusion layer intelligently combines the outputs of both models, weighting them by confidence. Before the tags are stored, a policy filter removes any tag that violates region-specific rules or sensitive topic guidelines.
  6. Feedback Loop: The results are written to a tag store, accessible by ranking services. The loop closes when editors verify or adjust tags, and early user interactions (e.g., specific search queries leading to installs) provide crucial weak signals for retraining.

The entire flow can be summarized as: ingest → preprocess → multimodal tagging → policy filter → tag store → feedback and monitoring.


Stage 3: Data Considerations

Data quality is critical; the reliability of the model hinges on how representative its training signals are.

  • Ground Truth: The primary supervised training data comes from existing mature apps whose tags have been verified by editors or derived from robust developer metadata.
  • Weak Signals: We derive additional weak labels from user behavior (e.g., if many users search for “habit tracker” and install the app, that infers the candidate tag). Textual context from OCR and ASR (from screenshots/trailers) also provides vital interface clues, such as “XP,” “quests,” or “flashcards”.
  • Features: For each app, we collect text features (titles, keywords, OCR tokens), visual embeddings (screenshot and icon representations), and structural metadata (age rating, SDKs used, category).
  • Multilingual Consistency: Multilingual normalization is required to map all content into a consistent semantic space, often involving domain-specific glossaries or multilingual encoders.
  • Handling Sparsity: Since missing or incomplete data (e.g., no screenshots or minimal descriptions) is inevitable, the system must rely more heavily on available modalities and explicitly lower confidence scores to prevent unreliable over-tagging.
  • Preventing Leakage: The dataset must be split temporally rather than randomly. This ensures the model is predicting tags for genuinely new apps that were not part of its training history.
  • Addressing Imbalance: Since popular tags are frequent and niche tags (like “birding guide”) are rare, we address this imbalance using techniques such as focal loss, class reweighting, and positive-unlabeled learning.
  • Validation: Establishing a small, high-quality “editor-gold” validation set (approximately 10,000 manually curated apps) is crucial for tuning confidence thresholds and achieving the optimal precision/recall trade-offs before launch.

Stage 4: Modeling, Metrics, and Training

Conceptually, the task is a combination of multilabel classification and controlled generation. The model must predict a subset of valid categories, as each app can have several tags.

  • Textual Modeling: An instruction-tuned LLM is highly effective here because it can reason over free-form descriptions. It is prompted with structured metadata and the taxonomy schema, using constrained decoding to ensure it only outputs valid tag IDs.
  • Efficiency Strategy (Distillation): To ensure the system is efficient for high-volume re-tagging, the powerful LLM’s knowledge can be distilled into a smaller, faster text encoder (like a BERT-style model).
  • Visual Modeling: We use a CLIP or ViT-based encoder fine-tuned on screenshots and icons. This model learns to correlate visual elements (e.g., shopping carts, musical notation) with specific app genres.
  • Training Objective: Training involves combining losses from both modalities. This typically includes a balanced binary cross-entropy or focal loss for the tags. We also incorporate a contrastive alignment term that encourages tag embeddings to align effectively with both text and vision features, alongside a coverage loss to penalize redundant tags.
  • Offline Evaluation: Metrics include micro and macro F1 scores, precision at k, coverage, and calibration error. We also measure how much the generated tags improve search relevance in historical replay experiments, tracking metrics like nDCG@10 for new apps.
  • Online Evaluation: Once live, metrics expand to business outcomes, including click-through rates, install conversion, and user retention. Precision and safety are non-negotiable; open-world tags must clear strict thresholds.
  • Handling Challenges: Fallback strategies (like nearest-neighbor retrieval) are used for sparse metadata. Continuous retraining is necessary to mitigate temporal drift (as new app types emerge) and multilingual drift.

Stage 5: Productionization, Trade-offs, and Deployment

In a production environment, tagging runs primarily in batch mode. When an app is updated, a background pipeline asynchronously processes it through the ensemble models, writing the resulting tags into a dedicated tag store. Downstream ranking systems can then read these precomputed tags in real-time (milliseconds) from a feature store to compute relevance scores.

  • MLOps and Monitoring: Model management adheres to standard MLOps practices, involving registration, canary testing, and A/B experiments.
  • Prioritizing Precision: Precision is the single most important KPI because inaccurate or unsafe tags can have legal implications. Rule-based overrides provide necessary quick patches while awaiting comprehensive retraining.
  • Key Trade-offs:
    • Precision vs. Recall: In early deployment, we err on the side of precision, only generating tags the system is highly confident in, even if it sacrifices some recall.
    • Latency vs. Cost: The computationally heavy LLM is used only during batch processing; high-speed downstream serving relies on precomputed embeddings.
    • Personalization vs. Privacy: Any future incorporation of personalization signals must ensure data remains anonymized or on-device to protect privacy.

Monitoring infrastructure constantly tracks input data drift, output tag distribution, and tag coverage. Weekly retrains and nightly calibration maintain freshness, while feedback loops continually refine the model’s decision boundaries.

The overall production mantra is safe automation with human oversight, allowing the system to scale to millions of apps while continually learning from real-world signals.


Stage 6: Strategic Questions for Future Development

To solidify the system’s deployment plan, two high-leverage questions are critical to ask internal stakeholders:

  1. Taxonomy Governance: How fixed is the underlying taxonomy? Does the model operate within a stable, curated list of tags, or is it expected to suggest new candidate tags over time?. The answer determines whether the system focuses on strict constrained classification or open-ended generation.
  2. Risk Tolerance: What are the acceptable precision and recall trade-offs for launch?. Knowing the team’s tolerance for false positives versus false negatives helps set decision thresholds and determines how aggressively new, “open-world” tags can be introduced.

These questions demonstrate an essential awareness of real-world constraints—like taxonomy governance and risk tolerance—that are vital for moving an ML research prototype into a successful production system.


This multimodal tagging system acts like a hyper-efficient librarian for new apps. Instead of waiting for millions of people to read the book (install the app) and report back on its category, the system scans the cover, title page, and internal structure (metadata and screenshots) immediately upon arrival, ensuring it gets placed on the correct shelf right away so the perfect reader can find it.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • ML Sys Design: Decoding the App Store Recommendation System
  • ML Sys Design: YouTube Recommendation System
  • ML Sys Design: ML Design Interview
  • Llm Cheatsheet
  • LangGraph Cheatsheet: Fundamentals and Implementation