Piniverse Content Creation Engine

This undergraduate research project investigated natural language generation (NLG) through the design, implementation, and evaluation of a machine learning–based text generation system. The work focused on producing contextually coherent, stylistically fluent, and semantically relevant outputs, while examining the trade-offs between model complexity, training data, and controllability.

Position:

Programmer

Conducted as a Major Qualifying Project (MQP) at Worcester Polytechnic Institute, this study systematically explored the challenges and opportunities of natural language generation. The project was grounded in contemporary advances in neural language modeling and sought to develop a framework that balanced technical rigor with practical application. The workflow began with the construction of a representative training corpus, emphasizing text diversity and linguistic coverage. Data preprocessing steps included tokenization, normalization, and sequence alignment to support efficient training across multiple model variants. Baseline experiments were performed using probabilistic and recurrent architectures, followed by the implementation of transformer-based models that allowed for long-range dependency capture and improved semantic cohesion.

The evaluation phase adopted a dual methodology, combining quantitative and qualitative assessments. Intrinsic metrics such as perplexity, BLEU scores, and type-token ratios were used to benchmark linguistic quality, while human-centered evaluation provided insights into narrative coherence, stylistic fidelity, and user engagement. Special attention was given to the problem of controllability: mechanisms were designed to steer model outputs based on prompts, stylistic constraints, or semantic conditions, thereby expanding the applicability of the system to domains such as guided storytelling, summarization, and adaptive dialogue. The study also engaged with ethical and practical concerns, including the propagation of bias, semantic hallucination, and repetition artifacts, and investigated mitigation strategies through data curation and output filtering.