White Paper

Towards Explainable AI in STEM Education

July 11, 202541 min read

Abdullah

Towards Explainable AI in STEM Education

Abstract

This paper proposes a novel architecture that integrates Manim, a Python library for programmatic animation, with Large Language Models (LLMs) to generate dynamic visualizations and natural language explanations for complex STEM concepts. We address the critical need for explainable AI in education by developing a system that not only provides accurate explanations but also visually demonstrates underlying principles, thereby enhancing comprehension and engagement. Our proposed Explainable STEM Concept Generator (ESCG) leverages an agentic framework, incorporating modules for natural language understanding, knowledge graph reasoning, Manim code generation with integrated LaTeX, and a visual feedback loop inspired by Joint Embedding Predictive Architecture (JEPA) principles. We discuss the challenges of current Manim-LLM integrations, such as prompt dependency and the lack of visual understanding, and present a comprehensive solution designed for scalability and generalizability across diverse STEM disciplines. This work aims to bridge the gap between textual explanations and visual demonstrations, offering a more intuitive and effective learning experience for students and educators.

1. Introduction

1.1. Background: The Growing Need for Explainable AI in STEM Education

The rapid advancements in Artificial Intelligence (AI) have ushered in an era where intelligent systems are increasingly integrated into various aspects of human life, including education. In Science, Technology, Engineering, and Mathematics (STEM) fields, AI holds immense promise for revolutionizing learning by offering personalized instruction, automated assessment, and access to vast knowledge bases. However, the complexity and often opaque nature of advanced AI models, particularly Large Language Models (LLMs), present a significant challenge: the lack of explainability. For AI to be truly effective as an educational tool, especially in foundational and complex STEM subjects, it must not only provide correct answers but also elucidate the reasoning behind those answers. This need for Explainable AI (XAI) in education is paramount, as it fosters deeper understanding, critical thinking, and trust in AI-driven learning systems [1]. Without transparent explanations, students may merely memorize facts without grasping the underlying principles, hindering genuine intellectual development.

1.2. Challenges in Traditional STEM Education and the Role of Visualization

Traditional STEM education often grapples with inherent difficulties in conveying abstract and intricate concepts. Many fundamental ideas in mathematics, physics, chemistry, and computer science are inherently visual or dynamic, yet conventional teaching methods frequently rely on static diagrams, textual descriptions, or abstract equations. This can lead to significant cognitive load for learners, making it challenging to build intuitive mental models of complex phenomena. For instance, understanding vector fields, quantum mechanics, or advanced algorithms often requires visualizing processes that unfold over time or in multi-dimensional spaces. Visualizations play a crucial role in overcoming these challenges by transforming abstract data into perceptible forms, thereby facilitating pattern recognition, insight generation, and knowledge construction [2]. Dynamic visualizations, in particular, can illustrate causality, demonstrate transformations, and simulate real-world processes, offering a powerful complement to static representations.

1.3. Introduction to Manim: A Tool for Programmatic Animation

Manim (Mathematical Animation Engine) is a free and open-source Python library created by Grant Sanderson for programmatically creating precise, high-quality animations. Initially developed for the popular YouTube channel 3Blue1Brown, Manim has gained widespread recognition for its ability to produce visually stunning and mathematically accurate educational videos. Unlike traditional animation software, Manim allows users to define scenes and objects using Python code, offering unparalleled control over every aspect of the animation, from the precise placement of mathematical symbols to the smooth interpolation of complex transformations. This programmatic approach ensures reproducibility, facilitates rapid iteration, and enables the creation of highly customized visualizations that are perfectly aligned with specific educational objectives. Its utility extends beyond pure mathematics, finding applications in physics, computer science, and other STEM disciplines where dynamic visual explanations are beneficial [3].

1.4. Introduction to Large Language Models (LLMs) and Their Potential in Education

Large Language Models (LLMs) are a class of AI models trained on vast datasets of text and code, enabling them to understand, generate, and interact with human language with remarkable fluency and coherence. Models such as GPT-4, Gemini, and Claude have demonstrated capabilities ranging from answering complex questions and summarizing documents to writing creative content and generating code. In the educational landscape, LLMs offer transformative potential. They can act as intelligent tutors, providing instant feedback, explaining concepts in multiple ways, and generating practice problems. They can also assist educators in content creation, curriculum development, and personalized learning path design. Their ability to process and synthesize information from diverse sources makes them powerful tools for knowledge dissemination and comprehension [4].

1.5. Problem Statement: Bridging the Gap Between LLM Explanations and Visualizations

Despite the individual strengths of LLMs in generating textual explanations and Manim in creating dynamic visualizations, a significant gap exists in their seamless integration for comprehensive STEM education. Current LLMs, while adept at producing verbose and seemingly coherent explanations, often lack a true understanding of the visual and spatial implications of the concepts they describe. This limitation becomes apparent when LLMs are tasked with generating Manim code; the resulting animations may suffer from technical errors, visual inconsistencies, or a lack of pedagogical effectiveness, even if the underlying code is syntactically correct [5]. Conversely, Manim, while powerful, requires significant programming expertise to create complex animations, posing a barrier for many educators and students. The challenge lies in developing a system that can leverage the explanatory power of LLMs and the visual precision of Manim, ensuring that the generated explanations are not only linguistically accurate but also visually coherent, pedagogically sound, and dynamically illustrative of the STEM concepts.

1.6. Our Contribution: A Novel Manim-LLM Architecture

This paper proposes a novel, integrated architecture, the Explainable STEM Concept Generator (ESCG), designed to bridge the aforementioned gap. Our primary contribution is a multi-module system that intelligently orchestrates the capabilities of LLMs and Manim to produce high-quality, dynamic visualizations coupled with natural language explanations for STEM concepts. The ESCG features:

  • Intelligent Scene Planning: An LLM-driven module that translates high-level STEM concepts into detailed visual storyboards and animation plans.
  • Robust Manim Code Generation: A specialized module that generates precise and pedagogically effective Manim scripts, incorporating advanced features like LaTeX integration for mathematical expressions.
  • Integrated Visual Feedback Loop: A novel component, inspired by JEPA principles, that allows the system to

evaluate and refine the generated Manim code based on visual coherence and correctness, addressing the LLM's inherent lack of visual understanding.

  • Coherent Natural Language Explanation: A module that generates clear, concise, and pedagogically sound natural language explanations that are synchronized with the visual animations.
  • Generalizability: The architecture is designed to be adaptable across various STEM disciplines, moving beyond domain-specific solutions.

By integrating these components, the ESCG aims to provide a comprehensive and intuitive learning experience, making complex STEM concepts more accessible and engaging for a wider audience.

1.7. Paper Organization

The remainder of this paper is organized as follows: Section 2 provides an overview of related work in Manim-based content generation, LLMs in education, and multimodal AI for explanation. Section 3 details the proposed Explainable STEM Concept Generator (ESCG) architecture, outlining each module and its functionalities. Section 4 presents the implementation details, including technologies used and illustrative code snippets. Section 5 discusses the evaluation methodology, experimental results, and a critical analysis of the system's performance and limitations. Finally, Section 6 concludes the paper with a summary of our contributions and outlines future research directions.

2. Related Work

2.1. Manim-based Educational Content Generation

Manim has emerged as a powerful tool for creating high-quality, mathematically accurate animations, primarily driven by its open-source nature and the influential work of 3Blue1Brown. Its programmatic approach allows for precise control over visual elements, making it ideal for illustrating complex STEM concepts. Numerous projects and initiatives have leveraged Manim to produce educational content across various disciplines. For instance, many YouTube channels and online courses utilize Manim to explain topics ranging from calculus and linear algebra to quantum mechanics and computer science algorithms. The Manim Community itself provides extensive documentation and examples, fostering a vibrant ecosystem of content creators [3].

While Manim excels in generating visually appealing animations, its primary limitation lies in the technical expertise required to wield it effectively. Creating even moderately complex animations necessitates proficiency in Python programming and a deep understanding of Manim's API. This steep learning curve often restricts its use to individuals with strong coding backgrounds, limiting its accessibility for educators who may lack such specialized skills. Furthermore, the process of translating abstract mathematical or scientific ideas into concrete Manim code can be time-consuming and iterative, involving significant manual effort in scripting and debugging. Despite these challenges, Manim remains a gold standard for programmatic visualization in STEM education due to its precision and aesthetic quality.

2.2. LLMs for Code Generation and Scripting

The advent of Large Language Models (LLMs) has revolutionized the field of code generation, demonstrating remarkable capabilities in translating natural language prompts into executable code across various programming languages. Models like OpenAI's Codex, Google's AlphaCode, and DeepMind's AlphaCoder have showcased the potential of LLMs to assist developers, automate routine coding tasks, and even generate novel algorithms. These models are typically trained on massive datasets of source code and natural language, enabling them to learn intricate programming patterns, syntax, and common idioms. In the context of Manim, LLMs have been explored for generating animation scripts from textual descriptions, aiming to democratize the creation of educational content [5].

However, the application of LLMs to Manim code generation is not without its challenges. While LLMs can produce syntactically correct Python code, they often struggle with the semantic and visual correctness required for effective animations. This includes issues such as incorrect object placement, illogical animation sequences, or failure to adhere to pedagogical principles. The inherent lack of visual understanding in text-based LLMs means they cannot

“see” the output of their generated code, leading to a trial-and-error process that can be inefficient. Furthermore, the quality of the generated code is highly dependent on the specificity and clarity of the input prompts, often requiring extensive prompt engineering to achieve desirable results [6].

2.3. LLMs in Educational Contexts (e.g., tutoring, content creation)

Beyond code generation, LLMs are increasingly being explored for their broader applications in educational contexts. Their ability to engage in natural language dialogue makes them suitable for roles such as intelligent tutors, providing personalized explanations, answering student queries, and offering feedback on assignments. LLMs can adapt to individual learning paces and styles, offering tailored support that traditional methods often cannot. They can also assist educators in content creation, generating lesson plans, quizzes, and summaries of complex topics. The potential for LLMs to democratize access to high-quality education, particularly in underserved regions, is immense [4].

However, the deployment of LLMs in education raises several pedagogical and ethical concerns. Issues such as factual accuracy, potential for hallucination, bias in training data, and the risk of over-reliance on AI for critical thinking development need careful consideration. While LLMs can provide information, their explanations may sometimes lack the depth, nuance, or contextual understanding that a human educator provides. Ensuring that LLM-generated content is pedagogically sound and promotes genuine learning rather than rote memorization is a critical challenge.

2.4. Multimodal AI for Explanation and Visualization

Multimodal AI, which integrates and processes information from multiple modalities such as text, images, audio, and video, offers a promising avenue for enhancing explanations and visualizations in STEM education. By combining the strengths of different modalities, multimodal systems can provide richer, more comprehensive, and more engaging learning experiences. For instance, a system that can generate both textual explanations and corresponding visual animations can cater to diverse learning styles and improve comprehension of complex concepts. The synergy between modalities can also help in disambiguating information and reinforcing understanding [7].

Recent advancements in multimodal LLMs, which can process and generate content across text and image domains, are particularly relevant. These models can potentially bridge the gap between abstract textual descriptions and concrete visual representations, enabling more intuitive and effective communication of STEM concepts. However, developing truly integrated multimodal AI systems that can seamlessly generate coherent and pedagogically effective explanations and visualizations remains an active area of research.

2.5. Existing Approaches Combining LLMs and Manim

Several pioneering projects have attempted to combine the power of LLMs with Manim to automate the creation of educational content. These initiatives highlight both the potential and the challenges of this interdisciplinary approach:

  • TheoremExplainAgent (TEA) [8]: This agentic system employs a two-agent architecture—a planner agent and a coding agent—to generate long-form Manim videos explaining mathematical theorems. The planner agent creates story plans and narrations, while the coding agent translates these into Manim scripts. TEA also introduces a benchmark for evaluating multimodal theorem explanations. Its strength lies in its agentic workflow, which helps in maintaining coherence over longer video durations.

  • HarleyCoops/Math-To-Manim [9]: This project focuses on leveraging various LLMs (e.g., DeepSeek AI, Google Gemini) to generate Manim animations from highly detailed, LaTeX-rich prompts. It emphasizes the importance of meticulous prompt engineering and the ability to generate simultaneous LaTeX study notes alongside the animations. The project demonstrates the potential for LLMs to handle complex mathematical visualizations when provided with precise instructions.

  • EduWiz [10]: Developed by students at the University of Toronto, EduWiz utilizes LangChain with OpenAI models to generate dynamic Manim animations for teaching STEM concepts. This platform aims to create on-demand animated tutorial videos, focusing on interactive and personalized learning experiences. It showcases the application of LLM orchestration frameworks like LangChain for educational content generation.

  • makefinks/manim-generator [11]: This project implements a code-writer and code-reviewer feedback loop to improve the quality of LLM-generated Manim code. It addresses the inherent error-proneness of LLM-generated code by introducing a review mechanism, and also explores the use of Vision Language Models (VLMs) to provide visual feedback, which is a crucial step towards addressing the LLM's lack of visual understanding.

2.6. Identified Gaps and Limitations in Current Research

While the existing approaches demonstrate significant progress, several critical gaps and limitations persist, which our proposed architecture aims to address:

  1. Lack of Unified Visual-Semantic Understanding: Current LLMs, even multimodal ones, struggle with a deep, unified understanding of both the semantic content of STEM concepts and their visual representation. This often leads to animations that are either semantically correct but visually unappealing/ineffective, or vice-versa. The integration of visual feedback mechanisms is nascent and often heuristic-based.
  2. Over-reliance on Explicit Prompt Engineering: Many existing systems require highly detailed and structured prompts, often in specific formats like LaTeX, to generate satisfactory Manim code. This limits accessibility for users without specialized knowledge and hinders natural language interaction.
  3. Limited Robustness and Error Handling: LLM-generated Manim code is prone to errors (syntax, runtime, or logical). While some projects implement feedback loops, a more robust and intelligent error diagnosis and self-correction mechanism, particularly one that incorporates visual error detection, is needed.
  4. Scalability and Generalizability Across STEM Domains: Most projects tend to focus on specific mathematical or physics concepts. A truly effective system needs to be generalizable across a broader range of STEM disciplines, requiring a more abstract and flexible knowledge representation.
  5. Absence of Comprehensive Pedagogical Integration: While animations are generated, the pedagogical effectiveness—how well the explanation aids learning, addresses misconceptions, and adapts to different learning styles—is often an afterthought rather than an integrated design principle.

Our proposed Explainable STEM Concept Generator (ESCG) architecture directly tackles these limitations by introducing a more sophisticated interplay between natural language understanding, knowledge representation, Manim code generation, and a novel visual feedback and refinement loop, thereby paving the way for more intuitive, accurate, and pedagogically effective AI-driven STEM education tools.

3. Proposed System Architecture: Explainable STEM Concept Generator (ESCG)

To address the identified limitations in current Manim-LLM integrations for STEM education, we propose the Explainable STEM Concept Generator (ESCG), a novel, modular, and agentic architecture designed to produce high-quality, dynamic visualizations coupled with natural language explanations from high-level user queries. The ESCG aims to bridge the gap between abstract STEM concepts and their intuitive visual and textual representations, fostering deeper understanding and engagement. Figure 1 illustrates the overall architecture of the ESCG.

3.1. Overview of the ESCG Architecture

The ESCG operates as a multi-agent system, where specialized modules collaborate to transform a user's natural language query into a synchronized Manim animation and a coherent natural language explanation. The core components of the ESCG include:

  1. Natural Language Understanding (NLU) and Intent Recognition Module: Responsible for parsing user queries, identifying key STEM concepts, and determining the user's intent (e.g., explanation, demonstration, problem-solving).
  2. Knowledge Graph and Semantic Reasoning Module: Acts as the central knowledge base, storing structured information about STEM concepts, their relationships, and relevant pedagogical approaches. It enables the system to perform semantic reasoning and retrieve pertinent information for explanation and visualization.
  3. Manim Code Generation Module: Comprising a Scene Planner and a Manim Script Generator, this module translates the semantic understanding into a visual storyboard and subsequently into executable Manim code, including LaTeX for mathematical expressions.
  4. Natural Language Explanation Generation Module: Responsible for crafting clear, concise, and pedagogically sound textual explanations that are synchronized with the generated Manim animations.
  5. Visual Feedback and Refinement Loop: A critical component that evaluates the visual output of the Manim code, identifies potential errors or sub-optimal visualizations, and provides feedback for iterative refinement, drawing inspiration from JEPA principles.
  6. Multimodal Output Synthesis: Integrates the Manim animation and the natural language explanation into a cohesive, synchronized learning experience.

Each module is designed to be interoperable, allowing for a flexible and extensible architecture that can adapt to new STEM domains and evolving pedagogical needs.

3.2. Natural Language Understanding and Intent Recognition Module

The NLU and Intent Recognition Module serves as the primary interface between the user and the ESCG. Its main functions are:

  • Query Parsing: Decomposing complex natural language queries into their constituent parts, identifying keywords, entities (e.g., 'force', 'derivative', 'DNA'), and relationships.
  • Concept Extraction: Identifying the core STEM concepts the user wishes to understand or visualize. This involves mapping natural language terms to predefined concepts within the Knowledge Graph.
  • Intent Classification: Determining the user's underlying goal. For example, a query like "Show me how a derivative works" indicates a request for a visual demonstration, while "Explain the concept of entropy" suggests a need for a textual explanation, possibly with illustrative examples.
  • Contextualization: Understanding the context of the query, including any prior interactions or specific learning objectives. This allows for more personalized and relevant responses.

This module leverages advanced LLM capabilities for semantic parsing and intent recognition, potentially fine-tuned on a dataset of STEM-related queries and their corresponding intents. The output of this module is a structured representation of the user's request, which is then passed to the Knowledge Graph and Semantic Reasoning Module.

3.3. Knowledge Graph and Semantic Reasoning Module

The Knowledge Graph (KG) and Semantic Reasoning Module is the central repository of STEM knowledge within the ESCG. It is designed to represent complex concepts, their interconnections, and relevant pedagogical strategies in a structured, machine-readable format. Key aspects include:

  • Ontology Development: Defining an ontology that captures the hierarchical relationships between STEM concepts (e.g., 'calculus' contains 'derivatives', 'derivatives' are related to 'rates of change'). This ontology also includes properties and attributes for each concept (e.g., 'derivative' has 'definition', 'geometric interpretation', 'applications').
  • Fact and Rule Storage: Storing factual information (e.g., formulas, laws, theorems) and pedagogical rules (e.g., 'when explaining concept X, always start with analogy Y').
  • Semantic Reasoning: Performing inference over the KG to retrieve relevant information, identify prerequisite concepts, and determine the most effective way to explain or visualize a given topic. For instance, if a user asks about 'quantum entanglement,' the KG can identify related concepts like 'superposition' and 'quantum mechanics' and suggest a pedagogical path.
  • Dynamic Knowledge Update: The KG can be dynamically updated and expanded, potentially through automated information extraction from scientific literature or expert input, ensuring that the system's knowledge base remains current and comprehensive.

This module acts as the 'brain' of the ESCG, providing the necessary context and factual basis for both Manim code generation and natural language explanation.

3.4. Manim Code Generation Module

The Manim Code Generation Module is responsible for translating the abstract understanding of STEM concepts into concrete, executable Manim scripts. This module is further divided into two sub-components:

3.4.1. Scene Planning and Storyboarding

This sub-module, primarily driven by an LLM, takes the structured query from the NLU module and the retrieved knowledge from the KG to create a detailed visual storyboard. This storyboard outlines the sequence of animations, the objects to be displayed, their transformations, and the timing of each event. It acts as an intermediate representation, bridging the gap between high-level concepts and low-level Manim code. The scene planner considers pedagogical principles, aiming to break down complex ideas into digestible visual segments. For example, explaining a derivative might involve scenes for: (1) introducing a function graph, (2) illustrating the secant line, (3) showing the limit process as the secant approaches the tangent, and (4) visualizing the derivative as the slope of the tangent. The output is a structured plan that guides the subsequent code generation.

3.4.2. Manim Script Generation (with LaTeX integration)

This sub-module takes the detailed scene plan and generates the actual Python code for Manim. It leverages LLMs that are specifically trained or fine-tuned on Manim's API and best practices. A crucial aspect is the seamless integration of LaTeX for rendering mathematical expressions. The LLM is guided to generate Manim code that correctly incorporates LaTeX syntax for equations, symbols, and formulas, ensuring high-quality visual presentation of mathematical content. This module also handles aspects like object instantiation, animation sequences, camera movements, and scene transitions. Error handling mechanisms are built-in to detect and attempt to correct common Manim-related errors, such as incorrect object types or animation parameters. The generated Manim script is then passed to the Visual Feedback and Refinement Loop.

3.5. Natural Language Explanation Generation Module

Operating in parallel with the Manim Code Generation Module, the Natural Language Explanation Generation Module is responsible for producing clear, concise, and pedagogically sound textual explanations. This module also utilizes LLMs, but with a focus on linguistic coherence, factual accuracy, and explanatory depth. Key features include:

  • Content Generation: Crafting explanations based on the information retrieved from the Knowledge Graph, tailored to the user's intent and contextual understanding.
  • Synchronization with Visuals: Ensuring that the textual explanation is precisely synchronized with the Manim animation. This involves generating narration cues or timestamps that align with specific visual events, creating a cohesive multimodal experience.
  • Adaptive Explanations: Adjusting the complexity and detail of the explanation based on the inferred user's prior knowledge and learning level. For instance, a beginner might receive a more simplified explanation with analogies, while an advanced learner might get a more rigorous, technical description.
  • Clarity and Conciseness: Focusing on generating explanations that are easy to understand, avoiding jargon where possible, and breaking down complex sentences into simpler structures.

3.6. Visual Feedback and Refinement Loop (incorporating JEPA principles)

This is a critical and novel component of the ESCG, designed to address the LLM's inherent lack of visual understanding and improve the quality of generated Manim animations. Inspired by Joint Embedding Predictive Architecture (JEPA) principles [12], this loop provides an automated mechanism for evaluating and refining the visual output. The process involves:

  • Manim Execution and Frame Capture: The generated Manim script is executed in a sandboxed environment, and key frames or short video segments are captured as visual outputs.
  • Visual Feature Extraction: A Vision Transformer (ViT) or similar visual encoder extracts abstract visual features from the captured frames. This encoder is trained to understand spatial relationships, object properties, and animation dynamics relevant to Manim visualizations.
  • Comparison with Expected Visuals: The extracted visual features are compared against an

expected visual representation, which is derived from the Scene Planning and Storyboarding sub-module. This comparison occurs in a shared embedding space, similar to JEPA, where both the Manim code (or its intermediate representation) and the visual output are mapped. The goal is to identify discrepancies between what was intended visually and what was actually generated.

  • Error Diagnosis and Feedback Generation: If discrepancies are detected, an LLM-based error diagnosis agent analyzes the visual features and the Manim code to pinpoint the source of the error (e.g., incorrect coordinates, wrong animation type, overlapping objects). It then generates specific, actionable feedback for the Manim Script Generation sub-module, guiding it to revise the code. This feedback can range from simple parameter adjustments to more complex structural changes.
  • Iterative Refinement: The Manim Script Generation sub-module receives this feedback and attempts to generate a revised Manim script. This iterative process continues until the visual output meets predefined quality metrics or a maximum number of iterations is reached. This feedback loop is crucial for overcoming the inherent limitations of LLMs in visual reasoning.

3.7. Multimodal Output Synthesis

The final stage of the ESCG is the Multimodal Output Synthesis, where the refined Manim animation and the synchronized natural language explanation are combined into a cohesive and engaging learning experience. This involves:

  • Video Rendering: The Manim script is rendered into a high-quality video file, incorporating all visual elements, animations, and LaTeX expressions.
  • Audio Narration: The natural language explanation is converted into speech using a high-quality Text-to-Speech (TTS) engine. The timing of the narration is precisely synchronized with the visual events in the Manim animation.
  • Interactive Interface: The final output can be presented through an interactive web interface, allowing users to control playback, pause, rewind, and potentially explore different aspects of the explanation or visualization. This interface can also display the textual explanation alongside the video, providing a comprehensive learning resource.

By seamlessly integrating visual and auditory modalities, the ESCG aims to create a powerful and intuitive learning environment that caters to diverse learning styles and maximizes comprehension of complex STEM concepts.

4. Implementation Details

This section delves into the practical implementation of the Explainable STEM Concept Generator (ESCG) architecture, outlining the key technologies employed and providing illustrative code snippets to demonstrate the interaction between different modules. Our implementation prioritizes modularity, scalability, and the effective integration of LLM capabilities with Manim for dynamic content generation.

4.1. Technologies Used

The ESCG is built upon a stack of open-source and proprietary technologies, carefully selected to optimize performance and functionality:

  • Manim: The core animation engine, providing the framework for programmatic creation of mathematical animations. We utilize the Manim Community Edition for its robust features and active development.
  • Large Language Models (LLMs): We leverage a combination of state-of-the-art LLMs for various tasks:
    • Google Gemini (or similar powerful LLM): For Natural Language Understanding, Intent Recognition, Semantic Reasoning, and high-level Scene Planning. Its advanced reasoning capabilities are crucial for interpreting complex STEM queries and generating coherent storyboards.
    • Fine-tuned Code Generation LLM (e.g., specialized version of Code Llama or GPT-4): For Manim Script Generation. This LLM is specifically trained or fine-tuned on a large corpus of Manim code and best practices to ensure high-quality, executable, and pedagogically effective scripts. This includes proficiency in generating LaTeX for mathematical expressions.
    • Google Gemini (or similar powerful LLM): For Natural Language Explanation Generation and Error Diagnosis within the Visual Feedback Loop. Its ability to generate clear and concise text, as well as analyze and provide feedback on code, is essential.
  • LangChain (or similar orchestration framework): To manage the complex interactions and data flow between different LLM calls and modules, facilitating an agentic workflow.
  • Knowledge Graph Database (e.g., Neo4j, RDF store): To store and query structured STEM knowledge, including concepts, relationships, formulas, and pedagogical rules. This provides a robust foundation for semantic reasoning.
  • Vision Transformer (ViT) (or similar visual encoder): For Visual Feature Extraction within the feedback loop. This model is trained to extract meaningful visual representations from Manim-rendered frames.
  • Text-to-Speech (TTS) Engine (e.g., Google Text-to-Speech API): To convert natural language explanations into high-quality audio narration for the final video output.
  • Python: The primary programming language for integrating all components and developing custom logic.

4.2. Data Preparation and Training (if applicable)

The effectiveness of the LLM components within the ESCG heavily relies on the quality and relevance of their training data. While foundational LLMs are pre-trained on vast datasets, specific fine-tuning or prompt engineering strategies are employed:

  • Manim Code Generation LLM: This LLM is fine-tuned on a curated dataset of Manim scripts paired with their corresponding natural language descriptions and visual outcomes. This dataset includes examples of correct Manim usage, common pitfalls, and pedagogical best practices. Data augmentation techniques are used to generate diverse animation scenarios and variations.
  • Knowledge Graph Population: The Knowledge Graph is populated with STEM concepts and relationships extracted from textbooks, scientific papers, and educational resources. This process can involve a combination of automated information extraction using LLMs and manual curation by subject matter experts.
  • Visual Feedback Loop Training: The Visual Feature Extraction model (ViT) is trained on a dataset of Manim-rendered frames, learning to identify key visual elements, their spatial relationships, and animation dynamics. The error diagnosis LLM is trained on examples of Manim code errors and their corresponding visual manifestations, enabling it to provide targeted feedback.

4.3. Code Snippets and Examples of Module Interaction

To illustrate the interaction between the ESCG modules, we provide simplified conceptual code snippets. These examples focus on the core logic and data flow, abstracting away the complexities of API calls and error handling for clarity.

4.3.1. Example: Natural Language to Manim Scene Description

This example demonstrates how a user's natural language query is processed by the NLU and Intent Recognition Module, and then used by the Scene Planning and Storyboarding sub-module to generate a structured scene description. This process involves identifying the core concept and breaking it down into visualizable steps.

Let's consider a user query: "Explain the Pythagorean theorem with an animation."

# Conceptual representation of the NLU and Intent Recognition Module
class NLUModule:
    def process_query(self, query: str) -> dict:
        # Simulate LLM processing to extract intent and concepts
        if "Pythagorean theorem" in query and "animation" in query:
            return {
                "concept": "Pythagorean Theorem",
                "intent": "visual_explanation",
                "details": {"focus": "geometric proof"}
            }
        return {"concept": None, "intent": "unknown"}
 
# Conceptual representation of the Scene Planning and Storyboarding sub-module
class ScenePlanner:
    def generate_storyboard(self, concept_info: dict, kg_data: dict) -> list:
        concept = concept_info["concept"]
        intent = concept_info["intent"]
 
        if concept == "Pythagorean Theorem" and intent == "visual_explanation":
            # Retrieve pedagogical steps from Knowledge Graph (simulated)
            pedagogical_steps = kg_data.get(concept, {}).get("pedagogical_steps", [])
 
            storyboard = []
            for step in pedagogical_steps:
                # LLM generates detailed visual plan for each step
                visual_plan = self._llm_generate_visual_plan(step)
                storyboard.append(visual_plan)
            return storyboard
        return []
 
    def _llm_generate_visual_plan(self, step_description: str) -> dict:
        # This would involve an LLM call to generate a detailed visual plan
        # based on the step_description and overall concept.
        # For demonstration, we'll return a simplified plan.
        if "draw right triangle" in step_description:
            return {
                "scene_id": "intro_triangle",
                "description": "Draw a right-angled triangle with sides a, b, and hypotenuse c.",
                "elements": ["right_triangle", "labels_a_b_c"],
                "animations": ["draw_shape", "fade_in_labels"]
            }
        elif "construct squares" in step_description:
            return {
                "scene_id": "squares_on_sides",
                "description": "Construct squares on each side of the triangle.",
                "elements": ["square_a", "square_b", "square_c"],
                "animations": ["draw_squares"]
            }
        elif "rearrange squares" in step_description:
            return {
                "scene_id": "rearrange_proof",
                "description": "Rearrange the squares to show area(a^2) + area(b^2) = area(c^2).",
                "elements": ["square_a", "square_b", "square_c"],
                "animations": ["move_and_rotate_squares", "show_area_equality"]
            }
        return {"scene_id": "", "description": "", "elements": [], "animations": []}
 
# Simulated Knowledge Graph data
knowledge_graph_data = {
    "Pythagorean Theorem": {
        "definition": "In a right-angled triangle, the square of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the other two sides.",
        "formula": "a^2 + b^2 = c^2",
        "pedagogical_steps": [
            "Draw a right triangle and label its sides a, b, c.",
            "Construct squares on each side of the triangle.",
            "Rearrange the squares to visually demonstrate a^2 + b^2 = c^2."
        ]
    }
}
 
# Example Usage:
nlu = NLUModule()
scene_planner = ScenePlanner()
 
user_query = "Explain the Pythagorean theorem with an animation."
concept_info = nlu.process_query(user_query)
 
if concept_info["concept"]:
    storyboard_plan = scene_planner.generate_storyboard(concept_info, knowledge_graph_data)
    print("Generated Storyboard:")
    for scene in storyboard_plan:
        print(f"  Scene ID: {scene['scene_id']}")
        print(f"  Description: {scene['description']}")
        print(f"  Elements: {scene['elements']}")
        print(f"  Animations: {scene['animations']}\n")
else:
    print("Could not understand the query.")

4.3.2. Example: Manim Code Generation with LaTeX

This snippet illustrates how the Manim Script Generation sub-module takes a scene description from the storyboard and translates it into actual Manim Python code, incorporating LaTeX for mathematical expressions. The LLM is prompted to generate Manim classes and methods.

Let's take the intro_triangle scene from the previous example:

# Conceptual representation of the Manim Script Generation sub-module
class ManimScriptGenerator:
    def generate_manim_script(self, scene_plan: dict) -> str:
        scene_id = scene_plan["scene_id"]
        description = scene_plan["description"]
        elements = scene_plan["elements"]
        animations = scene_plan["animations"]
 
        # This would involve an LLM call to generate Manim code
        # based on the scene_plan. The LLM is given Manim API context
        # and examples of LaTeX integration.
        # For demonstration, we'll return a simplified Manim code string.
 
        manim_code = f"""
from manim import *
 
class {scene_id.replace(' ', '_').title()}(Scene):
    def construct(self):
        # {description}
"""
 
        if scene_id == "intro_triangle":
            manim_code += """
        # Draw a right-angled triangle
        triangle = Polygon(
            ORIGIN,
            RIGHT * 4,
            UP * 3,
            color=BLUE
        )
        self.play(Create(triangle))
 
        # Add labels using LaTeX
        label_a = MathTex("a").next_to(triangle.get_edge_center(UP + LEFT), LEFT)
        label_b = MathTex("b").next_to(triangle.get_edge_center(RIGHT + DOWN), DOWN)
        label_c = MathTex("c").next_to(triangle.get_center(), UP + RIGHT)
        self.play(FadeIn(label_a, label_b, label_c))
 
        self.wait(1)
"""
        elif scene_id == "squares_on_sides":
            manim_code += """
        # Construct squares on each side
        square_a = Square(side_length=3).move_to(LEFT * 1.5 + UP * 1.5)
        square_b = Square(side_length=4).move_to(RIGHT * 2 + DOWN * 2)
        square_c = Square(side_length=5).move_to(ORIGIN)
 
        self.play(Create(square_a), Create(square_b), Create(square_c))
        self.wait(1)
"""
        elif scene_id == "rearrange_proof":
            manim_code += """
        # This scene would involve more complex animations to rearrange squares
        # and visually demonstrate a^2 + b^2 = c^2.
        # For brevity, we'll just show the formula.
        pythagorean_formula = MathTex("a^2 + b^2 = c^2")
        self.play(Write(pythagorean_formula))
        self.wait(2)
"""
        return manim_code
 
# Example Usage:
manim_generator = ManimScriptGenerator()
 
# Simulate a scene plan from the storyboard
sample_scene_plan = {
    "scene_id": "intro_triangle",
    "description": "Draw a right-angled triangle with sides a, b, and hypotenuse c.",
    "elements": ["right_triangle", "labels_a_b_c"],
    "animations": ["draw_shape", "fade_in_labels"]
}
 
generated_script = manim_generator.generate_manim_script(sample_scene_plan)
print("\nGenerated Manim Script for intro_triangle:")
print(generated_script)
 
# Example for a different scene
sample_scene_plan_2 = {
    "scene_id": "rearrange_proof",
    "description": "Rearrange the squares to show area(a^2) + area(b^2) = area(c^2).",
    "elements": ["square_a", "square_b", "square_c"],
    "animations": ["move_and_rotate_squares", "show_area_equality"]
}
 
generated_script_2 = manim_generator.generate_manim_script(sample_scene_plan_2)
print("\nGenerated Manim Script for rearrange_proof:")
print(generated_script_2)

4.3.3. Example: Integrating Visual Feedback for Refinement

This conceptual example demonstrates the Visual Feedback and Refinement Loop. After a Manim script is generated and executed, its visual output is analyzed. If discrepancies are found, feedback is generated and used to refine the Manim script. This loop is crucial for ensuring visual correctness and pedagogical effectiveness.

# Conceptual representation of the Visual Feedback and Refinement Loop
class VisualFeedbackLoop:
    def __init__(self, max_attempts=3):
        self.max_attempts = max_attempts
 
    def refine_manim_script(self, manim_script: str, expected_visual_features: dict) -> str:
        current_script = manim_script
        for attempt in range(self.max_attempts):
            print(f"\nRefinement Attempt {attempt + 1}:")
            # Simulate Manim execution and frame capture
            actual_visual_features = self._simulate_manim_execution(current_script)
 
            # Compare actual vs. expected visual features
            discrepancies = self._compare_visual_features(actual_visual_features, expected_visual_features)
 
            if not discrepancies:
                print("Visual output matches expected features. No refinement needed.")
                return current_script
            else:
                print(f"Discrepancies found: {discrepancies}")
                # LLM-based error diagnosis and feedback generation
                feedback = self._llm_diagnose_and_feedback(current_script, discrepancies)
                print(f"Generated Feedback: {feedback}")
 
                # LLM attempts to revise the Manim script based on feedback
                current_script = self._llm_revise_manim_script(current_script, feedback)
                print("Revised Manim Script generated.")
 
        print("Max refinement attempts reached. Could not achieve desired visual output.")
        return current_script
 
    def _simulate_manim_execution(self, script: str) -> dict:
        # In a real system, this would execute Manim and capture frames.
        # For simulation, we'll return a simplified representation of visual features.
        if "label_a = MathTex(\"a\").next_to(triangle.get_edge_center(UP + LEFT), LEFT)" in script:
            # Simulate a common error: label 'a' is slightly off-screen
            return {"triangle_present": True, "labels_present": True, "label_a_position": "off_screen_left"}
        return {"triangle_present": True, "labels_present": True, "label_a_position": "correct"}
 
    def _compare_visual_features(self, actual: dict, expected: dict) -> list:
        discrepancies = []
        if actual.get("label_a_position") != expected.get("label_a_position"):
            discrepancies.append("label_a_position_mismatch")
        # More complex comparisons would be done here (e.g., object overlap, animation timing)
        return discrepancies
 
    def _llm_diagnose_and_feedback(self, script: str, discrepancies: list) -> str:
        # LLM analyzes script and discrepancies to provide specific feedback.
        # Example: if 'label_a_position_mismatch' is found and 'UP + LEFT' is in script,
        # suggest adjusting the positioning.
        if "label_a_position_mismatch" in discrepancies and "UP + LEFT" in script:
            return "Label 'a' is off-screen. Adjust its positioning, perhaps use a smaller offset or a different relative position."
        return "General visual discrepancy. Review Manim code for visual correctness."
 
    def _llm_revise_manim_script(self, original_script: str, feedback: str) -> str:
        # LLM revises the script based on the feedback.
        # This is a simplified example; a real LLM would perform more intelligent revision.
        if "Label 'a' is off-screen" in feedback:
            return original_script.replace("LEFT)", "LEFT * 0.5)") # Example fix: reduce left offset
        return original_script
 
# Example Usage:
feedback_loop = VisualFeedbackLoop()
manim_generator = ManimScriptGenerator() # Re-using the generator from previous example
 
# Simulate an initial script (with a potential error for demonstration)
initial_scene_plan = {
    "scene_id": "intro_triangle",
    "description": "Draw a right-angled triangle with sides a, b, and hypotenuse c.",
    "elements": ["right_triangle", "labels_a_b_c"],
    "animations": ["draw_shape", "fade_in_labels"]
}
initial_script = manim_generator.generate_manim_script(initial_scene_plan)
 
# Define expected visual features
expected_features = {"triangle_present": True, "labels_present": True, "label_a_position": "correct"}
 
final_script = feedback_loop.refine_manim_script(initial_script, expected_features)
print("\nFinal Manim Script after refinement:")
print(final_script)

These conceptual snippets demonstrate the modularity and interaction within the ESCG. The actual implementation would involve robust API calls to LLMs, a more sophisticated Manim execution environment, and advanced visual analysis techniques. The iterative refinement loop, driven by visual feedback, is a cornerstone of our approach, enabling the system to overcome the inherent limitations of LLMs in visual reasoning and produce high-quality, pedagogically effective STEM visualizations.

5. Evaluation and Discussion

Evaluating the effectiveness of a system like the Explainable STEM Concept Generator (ESCG) presents unique challenges due to its multimodal output and pedagogical objectives. Traditional metrics for code generation or natural language processing alone are insufficient. Our evaluation framework focuses on assessing both the technical performance of the system and its pedagogical impact.

5.1. Evaluation Metrics

We propose a multi-faceted evaluation approach, combining automated metrics with human expert assessment and user studies:

  • Animation Accuracy and Visual Fidelity:

    • Manim Code Correctness: Automated checks for syntax errors, runtime errors, and adherence to Manim best practices. This can be measured by the success rate of Manim script execution.
    • Visual Fidelity (Automated): Using the Visual Feedback and Refinement Loop, we can quantify the reduction in discrepancies between generated and expected visual features over refinement iterations. Metrics could include image similarity (e.g., SSIM, PSNR) between rendered frames and target visual representations, or object detection accuracy for key visual elements.
    • Visual Fidelity (Human Expert): Subjective assessment by Manim experts on the aesthetic quality, clarity, and visual correctness of the animations. This includes evaluating factors like object placement, animation smoothness, and adherence to visual conventions.
  • Explanation Clarity and Accuracy:

    • Factual Correctness: Automated verification against the Knowledge Graph and external authoritative sources to ensure the accuracy of scientific and mathematical statements in the natural language explanation.
    • Clarity and Coherence: Human expert assessment of the linguistic quality, logical flow, and ease of understanding of the explanations. Metrics like ROUGE or BLEU scores can provide a preliminary, though limited, indication of coherence when compared to reference explanations.
    • Pedagogical Soundness: Evaluation by STEM educators on whether the explanation effectively breaks down complex concepts, addresses potential misconceptions, and aligns with established teaching methodologies.
  • Multimodal Synchronization and Cohesion:

    • Temporal Alignment: Automated analysis to ensure that the natural language narration is precisely synchronized with the corresponding visual events in the Manim animation. This can be measured by the average delay or lead time between audio cues and visual changes.
    • Conceptual Alignment: Human expert assessment of how well the visual and textual modalities complement each other in conveying the STEM concept. Does the animation truly illustrate what the narration describes, and vice-versa?
  • User Comprehension and Engagement (User Studies):

    • Pre/Post-Test Scores: Administering knowledge assessments to target learners before and after interacting with ESCG-generated content to measure learning gains.
    • Engagement Metrics: Tracking user interaction data such as watch time, replay rates, and feedback scores to gauge engagement levels.
    • Qualitative Feedback: Collecting open-ended feedback from students and educators through surveys and interviews to understand their perceptions of the system's usability, effectiveness, and areas for improvement.

5.2. Experimental Setup and Dataset

To rigorously evaluate the ESCG, we would establish an experimental setup involving a diverse set of STEM concepts, ranging from fundamental principles to advanced topics. The dataset for evaluation would comprise:

  • STEM Concept Prompts: A collection of natural language queries covering various STEM disciplines (e.g., physics, mathematics, chemistry, computer science) and different levels of complexity.
  • Reference Manim Scripts: For a subset of concepts, manually crafted, high-quality Manim scripts serving as ground truth for visual fidelity and code correctness.
  • Reference Explanations: Expert-written natural language explanations for the chosen concepts, used as benchmarks for clarity and factual accuracy.
  • Student Cohorts: For user studies, a diverse group of students with varying prior knowledge and learning styles to assess pedagogical impact.

The experiments would involve generating animations and explanations for the test concepts using the ESCG, followed by a systematic evaluation using the metrics outlined above. Comparisons would be made against baseline methods, such as purely text-based LLM explanations or manually created Manim animations.

5.3. Results and Analysis

We anticipate that the ESCG, particularly with its Visual Feedback and Refinement Loop, will demonstrate significant improvements over existing Manim-LLM integration approaches. Specifically:

  • Reduced Manim Code Errors: The iterative refinement loop, guided by visual feedback, is expected to substantially decrease the number of syntax and logical errors in generated Manim scripts, leading to a higher success rate of animation rendering.
  • Enhanced Visual Quality: The JEPA-inspired feedback mechanism should enable the system to produce animations with greater visual fidelity, better spatial arrangement of elements, and more pedagogically effective visual sequences, addressing the common issue of LLMs lacking visual intuition.
  • Improved Multimodal Cohesion: By synchronizing natural language explanations with visually accurate animations, the ESCG is expected to create a more cohesive and intuitive learning experience, leading to better user comprehension and engagement.
  • Generalizability Across STEM: The modular design and reliance on a comprehensive Knowledge Graph should allow the ESCG to adapt to a wide range of STEM concepts, demonstrating its potential for broad application in educational settings.

However, challenges are also anticipated. The computational cost of the iterative visual feedback loop might be considerable, requiring optimization. Furthermore, while the system aims for generalizability, fine-tuning or specialized knowledge might still be necessary for highly niche or advanced STEM topics. The subjective nature of pedagogical effectiveness will also necessitate extensive user studies to validate the system's true impact on learning outcomes.

5.4. Case Studies

To illustrate the capabilities of the ESCG, we would present detailed case studies. For example:

  • Case Study 1: Explaining the Concept of a Derivative in Calculus. This would showcase how the ESCG animates the secant line approaching the tangent, visually demonstrating the limit definition of a derivative, while simultaneously providing a clear, step-by-step natural language explanation. The system's ability to handle LaTeX for mathematical expressions like dydx=limΔx0f(x+Δx)f(x)Δx\frac{dy}{dx} = \lim_{\Delta x \to 0} \frac{f(x+\Delta x) - f(x)}{\Delta x} would be highlighted.
  • Case Study 2: Visualizing Quantum Superposition. This case study would demonstrate the ESCG's capacity to animate abstract physics concepts, such as a particle existing in multiple states simultaneously, and explain the underlying quantum mechanical principles in an accessible manner. The use of visual metaphors and precise timing would be crucial here.

These case studies would provide concrete examples of the ESCG in action, demonstrating its ability to transform complex STEM concepts into understandable and engaging multimodal content.

6. Conclusion and Future Work

6.1. Summary of Contributions

In this paper, we introduced the Explainable STEM Concept Generator (ESCG), a novel architectural framework designed to integrate Large Language Models (LLMs) with Manim for generating dynamic visualizations and natural language explanations of STEM concepts. Our primary contributions include:

  • A Modular and Agentic Architecture: We proposed a comprehensive system comprising distinct modules for Natural Language Understanding, Knowledge Graph reasoning, Manim Code Generation, Natural Language Explanation Generation, and a crucial Visual Feedback and Refinement Loop.
  • Novel Visual Feedback Loop: Inspired by JEPA principles, this loop addresses the inherent lack of visual understanding in LLMs by providing an automated mechanism for evaluating and refining Manim animations based on their visual output, leading to higher fidelity and pedagogical effectiveness.
  • Seamless LaTeX Integration: Our architecture emphasizes the generation of Manim code that effectively incorporates LaTeX for precise and visually appealing mathematical expressions, a critical feature for STEM education.
  • Focus on Multimodal Cohesion: The ESCG is designed to produce synchronized visual and textual explanations, creating a more engaging and intuitive learning experience that caters to diverse learning styles.
  • Addressing Key Limitations: The proposed system directly tackles the challenges of prompt dependency, error proneness in LLM-generated code, and the need for greater generalizability across STEM domains.

6.2. Implications for STEM Education

The ESCG holds significant implications for the future of STEM education. By automating the creation of high-quality, explainable, and dynamic educational content, it can:

  • Democratize Access to Visual Learning: Make complex STEM concepts more accessible to a broader audience, including those without strong mathematical or scientific backgrounds, by providing intuitive visual explanations.
  • Enhance Learning Outcomes: Improve student comprehension, retention, and engagement by offering multimodal learning experiences that cater to different learning styles.
  • Empower Educators: Provide educators with a powerful tool to rapidly generate customized teaching materials, freeing up their time to focus on personalized instruction and higher-order thinking skills.
  • Foster Critical Thinking: By offering transparent explanations and visual demonstrations, the system encourages students to delve deeper into the 'why' behind concepts, rather than just the 'what'.

6.3. Future Research Directions

While the ESCG represents a significant step forward, several avenues for future research and development exist:

  • Real-time Interaction and Personalization: Exploring the development of real-time interactive capabilities, allowing students to dynamically modify parameters, ask follow-up questions, and receive immediate visual and textual feedback. This could lead to highly personalized learning paths.
  • Broader STEM Coverage and Interdisciplinary Concepts: Expanding the Knowledge Graph and refining the LLM components to cover an even wider array of STEM disciplines, including interdisciplinary concepts that span multiple fields.
  • Advanced Visual Feedback Mechanisms: Investigating more sophisticated visual feedback techniques, potentially incorporating user eye-tracking data or neuro-scientific insights to further optimize the pedagogical effectiveness of animations.
  • Automated Misconception Detection and Correction: Developing modules that can identify common student misconceptions and automatically generate targeted animations and explanations to address them.
  • Integration with Virtual and Augmented Reality (VR/AR): Exploring how ESCG-generated content can be rendered in immersive VR/AR environments, offering even more engaging and interactive learning experiences.
  • Ethical Considerations and Bias Mitigation: Continuously researching and implementing strategies to mitigate potential biases in LLM-generated content and ensure equitable access and pedagogical fairness.

By continuing to refine and expand the capabilities of systems like the ESCG, we can unlock the full potential of AI to transform STEM education, making complex knowledge more accessible, engaging, and ultimately, more explainable for learners worldwide.

7. References

[1] Gunning, D., et al. (2019). XAI—Explainable Artificial Intelligence. Science Robotics, 4(37), eaay7120.

[2] Tversky, B. (2011). Visualizing thought. Cognitive Research: Principles and Implications, 2(1), 1-13.

[3] Manim Community Developers. (2024). Manim: Mathematical Animation Engine. Available at: https://www.manim.community/

[4] Chen, X., et al. (2023). Large Language Models in Education: A Survey. arXiv preprint arXiv:2312.10026.

[5] HarleyCoops. (2025). Math-To-Manim. GitHub repository. Available at: https://github.com/HarleyCoops/Math-To-Manim

[6] makefinks. (2025). manim-generator. GitHub repository. Available at: https://github.com/makefinks/manim-generator

[7] Bali, K., et al. (2023). Multimodal Large Language Models: A Survey. arXiv preprint arXiv:2311.09279.

[8] Ku, M., et al. (2025). TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding. arXiv preprint arXiv:2502.19400.

[9] HarleyCoops. (2025). Math-To-Manim. GitHub repository. Available at: https://github.com/HarleyCoops/Math-To-Manim

[10] Varoni Barro, L. F. (2025). EduWiz: Automating Manim Visualizations. AI Tinkerers Toronto. Available at: https://toronto.aitinkerers.org/talks/rsvp_fXSG2as-3RI

[11] makefinks. (2025). manim-generator. GitHub repository. Available at: https://github.com/makefinks/manim-generator

[12] LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Keynote at ICLR 2022. Available at: https://www.youtube.com/watch?v=0QeB7e0f2y8