[LLM 활용사례 관련, SKIP]
Contents
Due to the impressive reasoning, memory, and comprehension abilities inherent in large language models(OpenAI, 2022), substantial progress and prospects have arisen in various domains. Particularly in fields like finance, medicine, and law, customized large models tailored to specific verticals have emerged, efficiently tackling challenges issues commonly associated with general-purpose large models, such as vague responses and hallucinations caused by uniform training data distribution, thereby boosting staff productivity.
Through discussions with planners from city planning departments/companies, it became evident that significant amounts of time are expended on tasks such as planning text management, review, audit, and assessment. For instance, during text review, staff meticulously evaluate each item against a standard framework, rectifying errors or omissions in urban planning documents. Similarly, in text assessment, staff evaluate documents from multiple dimensions (legality, feasibility, economic viability, innovativeness), which consume considerable time and effort. Leveraging the robust comprehension and reasoning abilities of LLMs, we posit that the aforementioned processes can be addressed through the incorporation of large language model, as shown in Figure 1.
Figure 1: Review task workflow
However, in practical operations, we have found that it is not an easy task due to the inherent nature of the Chinese urban planning industry and the characteristics of urban planning texts:
To address the distinctive challenges inherent in urban planning texts, we present the first Large Language Model in the urban planning domain: PlanGPT. Firstly, it features a customized embedding model and vector database retrieval system for accurate information extraction in vast amounts of urban planning texts, overcoming the low signalto-noise ratio characteristic of the urban planning domain by using keyword extraction and hierarchical search techniques. Additionally, we employ instruction fine-tuning methods to activate the model’s interdisciplinary knowledge and enhance its proficiency in mastering the style of governmental documents, meeting the demands of planners. Furthermore, inspired by advancements in agentbased systems within the realm of large models, PlanAgent has been created to strategically utilize resources like networks, visual aids, charts, or domain-specific models. This approach significantly tackles the issues related to timeliness and multimodality in planning documents.
Experimental results have demonstrated that PlanGPT effectively addresses all the aforementioned challenges, fulfilling the needs of planners in the four typical tasks of daily work, surpassing other state-of-the-art models.
1 The models are suggested to open-source and CSPON-compliant. (China Spatial Planning Online Monitoring Network).
Large Language Models
Large language models (LLMs) encompass both general-purpose and vertical-specific applications, showcasing their versatility and effectiveness. Notable models like ChatGPT(OpenAI, 2022), GPT4(OpenAI, 2023), LLaMA(Touvron et al., 2023) series(Touvron et al., 2023), Bard(DeepMind, 2023a), PaLM2(et al., 2023b), Claude2(Anthropic, 2023), Mistral(Mistral-AI, 2023) and Gemini(DeepMind, 2023b), demonstrate broad capabilities across various tasks and industries. In the Chinese language domain, models like the Baichuan series, GLM series(Du et al., 2022), Kimi-chat, Yi, Qwen, Skywork(Wei et al., 2023) and LLaMA-Chinese(Cui et al., 2023b) offer several advantages tailored to the Chinese language and its unique challenges. Vertical-specific applications also benefit from LLMs. Examples include HuaTuo(Wang et al., 2023), a medical domain model, and ChatLaw(Cui et al., 2023a), an open-source legal LLM, which address specific needs within their respective domains. Similarly, XuanYuan 2.0(Zhang et al., 2023b) caters to the finance sector, DoctorGLM(Xiong et al., 2023) focuses on healthcare, and MathGPT(Tycho Young, 2023) enhances mathematical problem-solving capabilities. These models collectively highlight the diverse applications and potential of LLMs across different domains.
Domain
In the fields relevant to urban planning such as geography and transportation, several specialized models have emerged. TrafficGPT(Zhang et al., 2023a) integrates ChatGPT with traffic foundation models to enhance urban traffic management and decision support through data analysis and natural language dialogues. Prithvi(et al., 2023a), a NASA-derived model, focuses on climate, disaster, and geography predictions, pre-trained on IBM’s watsonx.ai, serving applications like climate change, flood mapping, and crop yield forecasting. TransGPT(Peng, 2023), as China’s first opensource traffic model, finds applications in traffic prediction, advisory, public transport services, urban planning, safety education, accident analysis, and autonomous driving support. EarthGPT(Zhang et al., 2024), a multi-modal large language model (MLLM) designed for remote sensing (RS) images, integrates RS interpretation tasks to enhance both visual perception and language understanding. Currently, there is no large model specifically tailored for urban and spatial planning domain, so we humbly introduce PlanGPT to address this gap.
In vertical domains, the faithfulness and factualness of large model outputs are heavily reliant. Retrieval techniques, fine-tuning methods, and agent tools have been proven to effectively mitigate model hallucination issues. RAG combines parameterized knowledge from LLMs with non-parameterized external knowledge to alleviate hallucination problems. Outstanding retrieval works such as Raven(Huang et al., 2023a), Retro(Borgeaud et al., 2022), Toc Sugre(Kim et al., 2023), selfmem(Cheng et al., 2024), genread(Yu et al., 2022), and RECITE(Sun et al., 2022) contribute significantly in this regard. Notably, SelfRAG(Asai et al., 2023) framework introduces a retrieval token to determine whether to recall documents, followed by assessing document validity using a critique token. FLARE(Jiang et al., 2023) iteratively executes retrieval, judging the need for answer regeneration based on probability calculations. RA-DIT(Lin et al., 2023) enhances LM’s use of retrieved information and refines the retriever for more relevant results, yielding significant performance gains when combined. Instruction fine-tuning significantly enhances model capabilities and effectively alleviates hallucinations. By employing methods like humpback(Li et al., 2023c), kun(Zheng et al., 2024), and muffin(Lou et al., 2023), we collect data from various sources, ensuring quality through filtering methods like deita(Liu et al., 2024), cherry(Li et al., 2023b), mods(Du et al., 2023), etc. Additionally, techniques such as wizardlm(Xu et al., 2023) and selfinstruct(Wang et al., 2022) increase data difficulty, improving model robustness. Agents can determine the appropriate tools to use, such as web searches(webglm(Liu et al., 2023),webgpt(Nakano et al., 2021)) or function calls, to enhance the quality of model outputs. Inspired by these work, we have innovatively proposed retrieval and instruction labeling methods tailored for urban planning domains, in conjunction with PlanAgent, effectively mitigating hallucination issues in large models.
Figure 2: PlanGPT Architecture
In this section, we will introduce the overarching framework and technical intricacies of PlanGPT.
In urban planning, professionals often struggle to find relevant materials from large datasets. This task can be modeled as the identification of the most pertinent document span $s^$ within a collection $S$, defined as $s^ = \arg\max_{s \in S} \text{Relate}(q, s)$, where $\text{Relate}(q, s)$ represents the similarity function between inquiry $q$ and document span $s$.
Advanced embedding methods are considered common solutions in enhancing semantic understanding, but they still produce suboptimal results in the field of urban planning due to two reasons: (1) Specialized Terminology: Urban planning possesses its own linguistic system, characterized by abbreviations and substitutions for specialized terms. For example, regulations may refer to zoning regulations, land type to land use classification, causing ambiguity, especially in Chinese. (2) Planner’s Perspective on Vocabulary: Common terms like land use carry richer meanings for planners. While commonly understood as land utilization, planners view it as interactions between people, land, and ecosystems. This difference in perspective affects semantic understanding and search accuracy.
Drawing inspiration from previous work involving embedding models(Cui et al., 2022, 2021; Mikolov et al., 2013; Chen et al., 2024; Reimers and Gurevych, 2019; Gao et al., 2022; Su, 2022), we introduce our embedding model Plan-Emb for urban planning domain. Plan-Emb is an embedding model tailored for comprehending urbanplanning-specific knowledge with two-stage training process: initial pre-training using general Chinese text labels(Bowman et al., 2015) , followed by supervised fine-tuning on self-collected urban planning datasets. A regularization InfoNCE loss(Oord et al., 2018) is introduced during the second stage to prevent catastrophic forgetting of prior model capabilities.
For fine-tuning data collection, we initially leverage LLMs to filter keywords or key sentences aligned with our self-curated teaching syllabus. Subsequently, a cost-effective approach involving perturbations, explanations, and rewriting is employed to generate positive samples. Following experiments have confirmed the effectiveness of PlanEmb.
To address the challenges of low signal-to-noise ratio and declining embedding capability with longer sentences, we introduce a novel hierarchical embedding approach for query processing (depicted in Algorithm 1). In the data pre-processing phase, tailored keywords extraction method (PlanKeyBert) is employed to extract relevant keywords di from input document D and store them in a hash-map, mapping each chunk di to its corresponding ki while retaining essential information. During the search process, a query Q is used to recall relevant documents from vectorDB based on keyword and semantic similarity scores. Subsequently, hard matching scores and advanced cross-attention scores are employed to rerank the recall results.
Large language models often struggle to integrate domain-specific knowledge, such as in urban planning, leading to language generation that deviates from established conventions. The challenge here lies not solely in the absence of domain-specific data2, but rather in the model’s incapacity to synthesize and apply knowledge within this specialized domain.
To address these challenges, we conducted a twostage model adaptation: Urban planning Knowledge Activation and Specific Capability Development.
Motivated by the Humpback(Li et al., 2023c) method, we propose a self-annotation technique tailored to urban planning, henceforth referred to
Large language models often struggle to integrate domain-specific knowledge, such as in urban plan-leveraging sparse annotations. Taking cues from methodologies like LIMA(Zhou et al., 2024) and MoDS(Du et al., 2023), the k-center(Sener and Savarese, 2017) algorithm is employed to bolster diversity in the generated instructions. We refer to the finely-grained data obtained through these five steps as core data and utilize it to fine-tune base models, thus activating knowledge relevant to urban planning.
2 It has been observed that a significant portion of pretrained data in general large-scale models already encompasses domain-specific data related to urban planning.
Engagement with urban planning departments and institutes reveals that large models can aid planners in generating sections for proposals, transferring styles, evaluating proposals and extracting information, but base models’ limited instruction following capabilities mean prompt learning alone is insufficient to address these tasks effectively. To address practical needs in the field, we further collected over 4,000 historical versions of official plans from provinces, cities, districts, and counties nationwide for targeted capability development. We selected segments with potential utility from them and constructed self-annotated pipelines for four tasks. For example, in text style transfer, we prompt the model to simplify or colloquialize corresponding segments, then have the model rewrite them to match the desired style, generating instructions pairs t⟨raw text, response⟩. We then employed prompt learning with varying temperatures or different models to generate responses of different quality, implementing automatic annotation to score the levels for fine-tuning the scoring model.
In the field of urban planning, professionals are required to have a solid grasp of domain-specific knowledge while also being proficient in utilizing tools relevant to the field. Drawing inspiration from previous work involving agents (Team, 2023b; Xie et al., 2023; Team, 2023a; Hong et al., 2023; Nakajima; Significant Gravitas; Wu et al., 2023; Lun et al., 2023), we have designed and developed an agent that aligns closely with the tasks and requirements of urban planning. This agent, coined as the “PlanAgent”, is intricately tailored to suit the intricacies of urban planning endeavors.
Figure 3: Urban planning-annotation as Urban planning-annotation, as illustrated in Figure 3.
The method unfolds as follows:
To assist urban planning professionals in executing complex tasks such as text review, audit, or evaluation, PlanAgent autonomously generates and optimizes task lists based on inputs from planners, subsequently executing them in sequence.
Table 1: Statistics of downstream tasks dataset. “#” indicates the number of samples.
PlanAgent proficiently utilizes specialized domainspecific models to execute pivotal tasks integral to urban planning. These tasks include reverse geocoding, knowledge graph construction, and image captioning. Furthermore, PlanAgent integrates advanced tools developed by urban planning researchers for tasks such as spatiotemporal analysis(Liu and Zhang, 2023; Zhang and Ning, 2023), transit-oriented development (TOD) settings(Shao et al., 2020), neighborhood life-circle urban planning(Zhang et al., 2022), integrated land use and transport planning(Shao et al., 2023), urban simulations(Zhang et al., 2020), digital-twin city platforms, and other essential components of smart city initiatives. This holistic approach ensures a scholarly and comprehensive engagement with the intricate challenges inherent in urban planning endeavors.
PlanAgent autonomously consolidates outputs from diverse LLMs (e.g., Vector LLM, Local LLM) and specialized models through advanced techniques. It can employs a customized reward model in DPO (Rafailov et al., 2024) or RLHF (Christiano et al., 2017) to select the optimal answer, while also utilizing a summarization model to enhance findings from multiple sources.
The overarching architecture of PlanGPT is depicted as outlined above figure 2, encapsulating its multifaceted capabilities.
In this section, we demonstrate the effectiveness of our model through extensive offline experiments.
PlanAgent utilizes Web LLM to access realtime planning regulations and updates. Drawing inspiration from WebGLM’s web crawling (Liu et al., 2023), it employs vector queries and URL crawlers to ensure precision. To further enhance search accuracy, we implemented orienting URL crawlers specifically designed to identify information sources related to urban planning.
4.1.1 Training corpora For urban planning knowledge activation, we curated a specialized dataset for urban planning from diverse sources, including study materials, highlyrated Q&A threads from urban planning forums, high-quality textbooks in related majors, and official documents published by local governments in recent years. Detailed statistics are provided in the appendix 8.3. Following meticulous selection using Urban-planning-annotation, we curated nearly 50k high-quality instruction pairs from the corpus, incorporating part of general-domain fine-tuning datasets like ShareGPT(Chiang et al., 2023) or Alpaca-52k3(Taori et al., 2023), which were then used to fine-tune the base model, enhancing its urban planning abilities. For the development of specific capabilities, we employ urban planning data and self-annotation as detailed in Section 3.2.2 to generate a dataset for downstream tasks, as illustrated in Table 1. Taking inspiration from LIMA, we have once again shown that even a small amount of fine-tuning data can yield satisfactory results, albeit with some instability4.
Text Generation Large language models offer significant advantages in generating urban planning documentation, including comprehensive land use plans, development proposals, and zoning ordinances. By leveraging these models, urban planning professionals can streamline the process of drafting complex documents, ensuring clarity, coherence, and adherence to legal and regulatory frameworks. To evaluate the quality of the generated content, we created a grading system from 0 to 3, with four levels indicating quality from poor to excellent. Four professional urban planners provided subjective assessments, and their average rating determined the final quality score (Human) of each model, which was then converted to a 100point scale.
3 Chinese and English Version 4In practical terms, approximately 10k fine-tuning data are required to attain greater stability in outcomes.
Text Style Transfer Urban planners commonly employ text style transfer techniques in their workflow. Large language models can assist in transforming brief or informal texts into the specific style of urban planning communication, thereby enhancing the efficiency of urban and rural workers. The evaluation method is similarly to Text Generation.
Text Information Extraction Large language models can extract key information from various textual sources, including urban planning reports, public comments, and academic studies, to support data-driven decision-making in urban and spatial planning. We self-annotate the top 5 crucial keywords for each test case and calculate accuracy (Acc), which means whether our model can predict the same keywords as we expected within an acceptable range of semantic variation.
Text Evaluation LLMs can aid urban planners in evaluating urban planning proposals by assessing the feasibility, sustainability, and community impact of diverse projects, thereby offering objective evaluations and recommendations. Notably, we simplify the evaluation process by assigning style ratings from 0 to 3 to each paragraph, treating it as a classification task with accuracy (Acc) and F1 scores. Additionally, we utilize the trained model to automatically evaluate two tasks 5 and report the scores(PlanEval).
We select several baseline models for comparison:
5 Text Generation, Text Style Transfer 7Yi-6B only completes 10.8 % of our tests, with the majority producing responses that do not meet our requirements.
7 We utilized ChatGPT & GPT-4 for annotating the test data, therefore we are not reporting this experiment.
We conduct fine-tuning experiments using four models: ChatGLM, LLaMA-chinese-7b(Cui et al., 2023b), Mistral-chinese-7b(HIT-SCIR, 2024), and Baichuan2-13b. Eventually, we select glm3-base as our pretraining model, recognized as the state-ofthe-art Chinese BaseLM with a smaller parameter scale.
Our implementation is built upon the Transformers framework (Wolf et al., 2020) using PyTorch. For experiments involving the Local-LLM introduced in Section 3.2, we employ full-parameter fine-tuning with AdamW (Loshchilov and Hutter, 2019) as the optimizer. The learning rate is initialized at 5e-5 and gradually decreased in a cosinewise manner during training. Additionally, we utilize DeepSpeed ZeRO3 (Aminabadi et al., 2022) with offload and FlashAttention2 (Dao, 2023) to optimize memory usage, employing bfloat16 precision, with a total batch of 64.
In experiments related to PlanEmb, we also utilize AdamW as the optimizer, setting the initial learning rates to 5e-5 for pre-training and 1e-5 for fine-tuning, with a progressive decrease in learning rates as training progresses. To expedite output, we employ VisionLLM(Kwon et al., 2023). with a temperature (τ) of 0.95 and a top_p value of 0.9. Training these models typically requires about 16 hours on 8 NVIDIA 4090 GPUs.
Table 2: Common Urban Planing Task Evaluation
Evaluation
For the aforementioned tasks, we selected prominent chat models with high rankings on the ceval(Huang et al., 2023b) and cmmlu(Li et al., 2023a) leaderboards to conduct experiments under zero-shot or few-shot conditions. The experimental results, along with corresponding evaluation metrics, are documented in Table 2. Among the four tasks, PlanGPT significantly outperformed all other models of similar scale, including proprietary models like ChatGPT, aligning closely with the awareness of urban planners. With an average 79% Spearman correlation coefficient to human assessment, PlanEval reflects PlanGPT’s effectiveness in evaluating text. However, it still faces challenges in making nuanced distinctions, such as between “best” and “good” quality.
Furthermore, we demonstrate the model’s performance during the question-answering process.
To ensure fairness and comprehensiveness, we utilized the urban_and_rural_planner_test in CEval(Huang et al., 2023b), referred to as v1, comprising 418 questions. C-Eval is recognized as a reputable Chinese evaluation suite for foundation models, featuring 13,948 multiple-choice questions across 52 diverse disciplines and four difficulty levels.
Additionally, for a broader assessment of model urban planning capabilities, we manually curated approximately 3.5k evaluation questions, including authentic questions from urban and rural planning examinations over the past decade, forming urban_and_rural_planner_test v2. We calculated the score ratio between the two assessments, denoted as δ, in which higher values indicate a more honest assessment of the model’s capabilities. Notably, we strictly followed prompt templates recommended by lm-harness-test(Gao et al., 2023) and C-Eval, selecting options with the highest probabilities. Employing a 0-shot setting, we systematically tested models of comparable sizes listed on the leaderboard and reported their scores, as illustrated in Table 3.
After fine-tuning with the core dataset as introduce in section 3.2.1, our model achieved state-ofthe-art performance among open-source models of similar sizes. It exhibited an approximately 5% increase in accuracy compared to the base model. Furthermore, approaching a δ value close to 0.8 indicates the honesty and domain-generalization capabilities of our model.
Table 3: Urban Planning Knowledge Assessment
To evaluate the performance of Plan-Emb in expressing specialized terminologies and language systems in urban planning, we employed the method described in Section 3.1.1 to generate the urban-rural-STS-B-test (URSTS-B), which consists of two levels: 0, indicating no relation, and 1, signifying a stronger correlation between the word and its explanation. We rigorously evaluated the performance of various phases of Plan-Emb on URSTS-B and other general datasets, employing Spearman’s correlation coefficient (Spearman, 1961) for assessment. As shown in the table 4, it’s obvious that with the help of the fine-tuning stage, Plan-Emb holds more information in urban planning than any general models, which indicates that our embedding strategy exhibits superior aggregational efficacy. Furthermore, it is noteworthy that as training progresses, BERT-cse significantly outperforms BERT-base, underscoring the critical importance of the first-stage pretrain.
A visualization of the t-SNE(Van der Maaten and Hinton, 2008) projection between Plan-Emb and BERT-cse is shown in 4. From the marked examples, we can draw the conclusion that Plan-Emb learns the relation in urban and rural planning much better than BERT-cse in most cases. The terms land utilization (“土地利用”) and benefits (“利 益”), along with those representing ancient capital type (“古都型”) and cultural relics (“文物”), which frequently co-occur in urban planning documents, exhibit significantly reduced distances in the t-SNE projection space of Plan-Emb compared to BERT-cse. Additionally, standard residential floor plan layout, construction land planning permit, and planned total area schematic diagram, all indicative of domain knowledge in regional planning, demonstrate enhanced aggregative properties within Plan-Emb.
Ablation experiments were conducted on VectorLLM to demonstrate the effectiveness of customized modules in enhancing downstream task performance. Following the design of previous experimental settings, we extracted appropriate segments from a large corpus of text to answer
12+ denotes the model’s performance after the initial pretraining stage using SBERT with a portion of the training data, while BERT-cse reflects the model’s performance after being fully pretrained with CoSENT. 12STS-B(Cer et al., 2017) 12PAWSX(Yang et al., 2019) 12LCQMC(Liu et al., 2018) questions in urban_and_rural_planner_test, and calculated score@k, representing the accuracy of answered questions within the top k segments. To ensure fairness, network retrieval tools were disabled, and model judgments were based solely on contextual and intrinsic knowledge. We systematically removed Plan-Emb and Plan-HS, documenting the experimental outcomes in Table 5. Our findings indicate that the removal of any task component led to a decline in performance. Specifically, the elimination of each component (Plan-Emb and Plan-HS) resulted in score reductions of 0.7% and 3.6%, respectively. This indirectly highlights the superior expressive capability of Plan-Emb for urban planning texts. Additionally, it’s worth noting that Plan-HS effectively tackled issues related to texts with a low signal-to-noise ratio, significantly enhancing information utilization and accuracy.
In this section, we will discuss relevant tasks in the domain of real-world urban planning and provide potential solutions.
Review is the primary task of urban planning institute staff, as extensively discussed in Section 1, which consumes a significant amount of time. By utilizing VectorLLM to identify reference standard to document queries and then conducting reviews using PlanAgent, we believe that LLMs can detect inconsistencies, inaccuracies, or discrepancies within the text, ensuring the integrity and quality of urban planning proposals.
However, in practical work, we have found that despite sophisticated prompting, large models often fail to align with human consciousness, exhibiting extremes by either detecting minor errors that could be overlooked or excessively relaxing standards, resulting in lower recall rates.
Our solution involves employing GPT-4 to randomly introduce partial errors into urban planning text, along with indicating their locations. Our staff then identify error reasons, categorized into three types: 1. factual errors 2. spelling/grammar errors 3. stylistic errors (including harmful language). Initially, we refine the cognitive capabilities of largescale models to discern the mere presence of errors. Subsequently, we instruct them to identify and flag errors.
Figure 5: Assessment Task process
In the urban planning domain, text evaluation is a complex task, including verifying the framework of the text, reviewing the details and style of the text (as in the aforementioned review steps), and scoring the overall nature of the document. The overall nature of the document includes novelty, feasibility, and utility.
In actual operations, our solutions are as follows: Novelty: We will use vectorLLM to quickly retrieve and match historical urban planning. Feasibility: PlanAgent integrates network search tools and multimodal capabilities to solve. Utility: To evaluate the efficacy of the proposed plan, we will develop a simulation environment where multiple PlanAgents will engage in role-playing activities. Through simulated interactions and scenario analyses, the plan’s effectiveness will be assessed across diverse contexts.
In our future endeavors, we aim to explore several key directions to further the advancement of urban and spatial planning:
planning. Our goal is to enrich the knowledge base for both urban and rural planning contexts.
We advocate for a comprehensive overhaul of future urban planning frameworks. By addressing industry concerns and promoting progressive strategies, we envision a gradual yet impactful transformation of future urban planning practices.
In this Paper, we introduced PlanGPT, the first large-scale language model framework designed specifically for the field of urban and spatial planning. Through a customized approach, we successfully addressed challenges in urban planning text management, review, and assessment, demonstrating its efficiency and superiority in practice. Our work signifies a significant step forward in the convergence of artificial intelligence and urban and rural planning, providing planners with powerful support tools and facilitating more intelligent and efficient decision-making in urban and rural development. In the future, we will continue to refine and expand the capabilities of PlanGPT to further advance its application in the urban planning domain.