Publication

Machine learning and genomics based LCA to optimize crop productivity and environmental sustainability

Ndlovu, Noel
Citation
Abstract
The simultaneous selection of crop genotypes that deliver high yield, resilience to (a)biotic stress, and reduced environmental impact represents one of the most pressing challenges in both tropical and temperate agrifood systems. As climate change alters growing environments, natural resources become increasingly constrained, and global food demand continues to rise, crop improvement strategies must evolve beyond the pursuit of individual traits. Instead, there is a critical need for integrated, multi-trait breeding frameworks that can effectively address this complex and dynamic landscape. In response, this study presents a comprehensive, cross-disciplinary framework that unifies genome-wide association studies (GWAS), quantitative trait loci (QTL) mapping, joint linkage association mapping (JLAM), genomic prediction (GP), life cycle assessment (LCA), and machine learning (ML) to quantify and optimize both the agronomic performance and ecological footprint of improved cereal, legume, and forage crops under a range of biotic and abiotic stress conditions. The framework is designed to support data-driven decision-making in breeding programs by linking genetic architecture with environmental sustainability metrics. Three core objectives underpin this approach: (i) to dissect the genomic architecture underlying yield, nutritional quality, agronomic traits, and tolerance to critical stressors in cereals; (ii) to evaluate the environmental impacts of stress-resilient cereal genotypes using a machine learning-enhanced LCA methodology; and (iii) to assess the scalability and cross-crop applicability of the framework, demonstrated through its implementation in legume and forage systems under real-world stress scenarios. The framework was applied to a set of target crops - tropical maize (Zea mays L.), soybean (Glycine max (L.) Merr.), and perennial ryegrass (Lolium perenne L.) - cultivated under diverse environmental stressors, including drought, low soil nitrogen, Striga spp. infestation, and northern corn leaf blight (NCLB) pressure. Multi-environment trials were conducted across contrasting agroecological zones in Kenya, Ireland, Mexico, South Africa, Thailand, Zambia, and Zimbabwe. This integrative approach provides a scalable pathway toward breeding climate-smart crops that align high productivity with enhanced nutritional value and environmental sustainability, thus supporting resilient agrifood systems across varied geographies. To address the first objective - dissecting the genomic architecture underlying grain yield and related traits under a spectrum of (a)biotic stress conditions - multi-environment field trials and molecular analyses were conducted on over 3,000 tropical maize genotypes. These included evaluations under low soil nitrogen, drought, Striga spp. infestation and NCLB disease pressure. In Kenya and South Africa, a panel of 410 inbred lines and four bi-parental populations was phenotyped under both optimum and nitrogen-limited soil environments. Broad-sense heritability (H²) estimates for key grain quality traits (i.e., protein, starch, and oil content) under low nitrogen stress ranged from 0.18 to 0.86, indicating substantial genetic variation. GWAS identified 42 significant single nucleotide polymorphisms (SNPs) linked to grain quality traits, corresponding to 51 putative candidate genes. Of these, 80.4% (41 genes) had functional annotations, while 19.6% (10 genes) encoded proteins of unknown function. Several annotated genes were associated with nitrogen-responsive metabolic pathways. For instance, GRMZM2G159307 and GRMZM2G104325, encoding ATP-binding proteins with serine/threonine kinase activity, were linked to grain yield and starch content under optimal conditions. Under nitrogen-limited conditions, genes such as GRMZM2G10816 (yield), GRMZM2G070523, and GRMZM2G080516 (oil content) were involved in DNA biosynthesis, while GRMZM2G033694 - a histone-lysine N-methyltransferase associated with shoot apex development – was responsive across both soil nitrogen regimes. Complementary linkage mapping revealed multiple quantitative trait loci (QTLs) for grain yield and quality traits across nitrogen conditions. Notably, Chr. 1 harboured multi-trait QTLs in regions spanning 209-214 Mb and 268-280 Mb. On Chr. 2, bins 2.03 and 2.06 contained QTL clusters associated with grain yield, starch, and oil content, while chromosome 3 bin 3.06 exhibited co-localization of QTLs for protein, starch, and oil content. Additional trait-linked QTLs were identified on chromosomes 4, 5, 6, and 10. Two GWAS-significant SNPs - S1_269023923 (oil content) and S5_11883140 (grain yield) - co-localized with QTL qOC_01_269 and qGY_05_15, respectively. Genomic prediction analyses under low nitrogen conditions demonstrated high accuracy for oil content (r = 0.78) and lower accuracy for grain yield (r = 0.08), underscoring the yield trait’s sensitivity to soil nitrogen limitation. The CML550/CML504 test cross yielded the highest prediction accuracies for protein (0.66), oil (0.73), and starch (0.7) content under low N stress. Parallel drought-stress trials conducted in Kenya and Zimbabwe using three F3 populations (753 families) revealed grain yield reductions of 31-59%. Under well-watered conditions, the four parental lines - CML543, CML444, LapostasequiaC7-F71, and CKL5009 - achieved grain yields of 6.97, 6.30, 6.31, and 5.87 t ha-1, respectively, while under drought stress, yields declined to 2.32, 2.68, 5.08, and 3.69 t ha-1. QTL analyses identified 93 and 41 QTLs loci associated with grain yield, anthesis-to-silking interval, plant height, and ear height under well-watered and drought-stressed conditions, respectively. Eight major-effect QTLs (explaining >10% phenotypic variance) were detected under optimal conditions, compared to only two under drought stress. Joint linkage association mapping (JLAM) identified 25 QTLs for grain yield under well-watered and 4 under water-limited conditions, with phenotypic variance explained (PVE) ranging from 0.80-3.9% (well-watered) and 1.4-1.8% (drought), primarily located on chromosomes 4 and 6. Five-fold cross-validation supported moderate to high genomic prediction accuracies (r = -0.15 to 0.90), reflecting the polygenic nature of drought tolerance. In Kenya, one association panel, three doubled haploid (DH), and three F3 populations were evaluated for northern corn leaf blight (NCLB) resistance in disease hotspots. Disease severity scores on a 1.0-9.0 scale indicated high susceptibility in DH populations, with means of 5.17 (CML494×CML550) and 4.69 (CML511×CML550). In contrast, F3 populations CZL0723×CZL0719 (mean = 3.06) and CZL0009×CML505 (mean = 2.21) showed superior resistance. Across six populations, 23 QTLs conferring resistance were identified: three, six, and four in DH populations 1, 2, and 3, respectively, explaining 34.3%, 51.4%, and 41.1% of phenotypic variation; and two to four per F3 population, with individual QTLs explaining 2.8-15.8% of the variance. Through JLAM, 37 NCLB resistance QTLs were mapped across all 10 chromosomes, accounting for 49.4% of total phenotypic variation. GWAS using 337,110 high-quality SNPs identified 15 significant marker–trait associations. Several SNPs were located within genes containing functional domains related to stress response and developmental regulation. For example, SNP S2_213818302 was associated with peroxidase activity and oxidative stress tolerance, while S6_100083188 corresponded to a gene encoding phosphoglycerate kinase (PGK), a key enzyme in plant defence. Genomic prediction models yielded moderate accuracies for NCLB resistance (r = 0.42-0.55), supporting the trait’s quantitative architecture shaped by a combination of major-effect loci and numerous minor-effect QTLs. To evaluate Striga hermonthica resistance, 328 maize testcrosses and six commercial hybrids were phenotyped under artificial infestation in Kenya. Broad-sense heritability ranged from 0.28 to 0.76. The donor line TZSTR167 was among the top performers, yielding 5.03 t ha-1 with a Striga damage rating scores (SDR) of 2.3 (on a 1 to 5 scale). GWAS using five multi-locus models (FastmrMLM, FASTmrEMMA, pLARmEB, pKWmEB and ISIS EM-BLASSO) identified 81 quantitative trait nucleotides (QTNs) distributed across all 10 chromosomes. Key candidate genes included those encoding FAD-dependent oxidoreductase, RPM1-interacting protein 4 (RIN4), and Expansin-B4, associated with grain yield under infestation. Genomic prediction models achieved high accuracy for Striga counts at 10 weeks (r = 0.70) and SDR (0.60, but lower accuracy for grain yield and silking date (r = 0.40). These results confirm the complex, polygenic nature of (a)biotic stress resistance in tropical maize. In pursuit of the second objective - evaluating the environmental performance of stress-tolerant genotypes - a machine learning-supported life cycle assessment (LCA) framework was applied to maize grown under low nitrogen and drought stress. In Kenya, average yield dropped by 47% under low nitrogen (3.78 t ha-1) versus optimum conditions (7.14 t ha-1). Protein and oil content decreased by 2.98% and 12.75%, respectively, while starch content increased by 0.86%. Genotypes such as CML505/LaPostaSeqC7-F64-2-6-2-2-B-B5017-B, CML505/LaPostaSeqC7-F64-2-6-2-2-B-B5023-B and CML505/LaPostaSeqC7-F64-2-6-2-2-B-B5350-B were identified as tolerant based on indices like stress tolerance (TOL), stress susceptibility index (SSI), yield stability index (YSI), and percent yield reduction (PYR. LCA results for these genotypes under low N stress showed average per-kilogram grain impacts of 0.39 kg CO2-eq (global warming), 8.03×10-5 kg P-eq (eutrophication), 0.005 kg SO2-eq (acidification), 0.0012 kg NOₓ-eq (oxidant formation), and 2.19 kg CFC-11-eq (ozone depletion). DH lines (CML550/CML511)-DH111 and DH26 showed the most environmentally efficient profiles, while others, including CML505/LaPostaSeqC7-F64-2-6-2-2-B-B5017-B and CML505/LaPostaSeqC7-F64-2-6-2-2-B-B5350-B, exhibited higher global warming potential (GWP100) and nutrient loss potentials. XGBoost classifiers differentiated high-yield/high-impact genotypes from low-impact, stress-resilient ones, achieving a mean accuracy of 0.84 (±0.10). SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanations) and Principal component analysis (PCA) analyses identified clusters of genotypes: (1) high-yield, high-impact; (2) moderate-yield, nitrogen-efficient; and (3) low-yield, low-impact stress-resilient. An expanded evaluation across Kenya, Mexico, and Thailand included 1,053 maize genotypes under drought and optimum conditions. Yield reductions ranged from 58% ((CML543/LapostaSequiaF71)F3_Pop_2_241) to 64.6% ((CML543/CML444)F3Pop_1_180) in Kenya, 87% (CML-326-B-B/CML-312 SR) to 92.6% (CML311/MBR C3 Bc F95-2-2-1-B-B-B-B-B-B-B/CML-312 SR) in Mexico, and 69.8% (CLA149-B-B/CML-312 SR) to 94.7% (DTPWC9-F67-2-2-1-B-B-B-B-B/CML-312 SR) in Thailand. In Kenya, GWP100 ranged from 0.26 to 0.38 kg CO2-eq per kg grain under optimal conditions (mean = 0.31 kg CO2-eq) and increased to 0.43 - 0.84 kg CO2-eq (mean = 0.58 kg CO2-eq) under drought. Mexico recorded GWP100 of 0.26-0.62 kg CO2-eq (mean = 0.36 kg CO2-eq) under optimal conditions, increasing markedly under drought to 0.30 - 1.39 kg CO2-eq (mean = 1.08 kg CO2-eq). Thailand recorded the lowest GWP100 values under optimal conditions (0.32 kg CO2-eq; range: 0.20 - 0.51), but drought stress drove GWP100 values significantly higher (0.52 -1.29 kg CO2-eq; mean = 0.8 kg CO2-eq). The lowest GWP100 values under optimum conditions were recorded for La Posta Seq C7-F78-2-1-1-1-B-B-B-B/CML-312 SR (0.2 kg CO2-eq in Thailand), (CML543/LapostaSequiaF71)F3Pop 2_22 (0.26 kg CO2-eq in Kenya) and CIMCALI8843/S9243-BB-#-B-5-1-BB-4-1-3/CML-312 SR (0.26 kg CO2-eq in Mexico). Under drought, CLQ-RCWQ39=(CML159*CML144)-B-27-1-2-B*3-B-B/CML-312 SR (Mexico) recorded the lowest GWP100 score: 0.3 kg CO2-eq. To model the relationship between stress tolerance and environmental impacts, linear regression, random forest, and XGBoost algorithms were applied across eight Recipe 2016 (H) midpoint LCA categories. For drought-induced GWP100, linear regression performed best (R2 = 0.87, RMSE = 0.11, MAE = 0.065), outperforming random forest (R2 = 0.68) and XGBoost (R2 = 0.72). SHAP analyses highlighted strong associations between stress indices – particularly relative stress index (RSI), geometric mean productivity (GMP), mean relative performance (MRP), and stress susceptibility index (SSI) - and GWP100. Yield index (YI), mean productivity (MP), and MRP consistently showed negative correlations with environmental impacts, supporting their utility for sustainability-focused genotype selection. To achieve the third objective - testing the cross-crop scalability of a genomic-LCA-machine learning (ML) framework - methodologies were extended to forage- and legume-based agri-food systems in temperate and tropical environments. In Ireland, a perennial ryegrass-based pasture dairy model was developed using the Farm Level Module of the GOBLIN (General Overview for a Back-casting approach of Livestock Intensification) model. Baseline emissions per kg fat-and-protein-corrected milk (FPCM) were 1.08 kg CO2-eq (GWP100), 0.0066 kg PO4-eq (eutrophication), and 0.013 kg SO2-eq (acidification). XGBoost regressors trained on 10,000 simulated scenarios achieved high predictive accuracy (R² = 0.99), dry matter digestibility, crude protein, and nitrogen fertiliser input as primary emission drivers. Optimizing these traits revealed an ‘LCA-designed’ ideotype capable of reducing GWP100 by 36.7%, acidification by 31%, and eutrophication by 29% without compromising system productivity. These results highlight the critical but often overlooked contribution of forage ideotypes to environmental performance in livestock systems, reinforcing the need for integrated crop-livestock breeding strategies. In Zambia, yield-scaled life cycle assessments of 70 soybean genotypes across three agro-ecological zones revealed significant variability in both yield (μ = 2.3 t/ha, p<2×10⁻¹⁶) and environmental impacts. Mean GWP100 was 921.6 kg CO2-eq per tonne of grain, with top performing genotypes achieving values as low as 654 kg CO2-eq. Genotypes exhibiting moderate yield stability (CV = 0.37-0.51) showed optimal trade-offs across global warming, eutrophication (0.24 kg P-eq), acidification (1.98 kg SO2-eq), and particulate matter formation (1.16 kg PM2.5-eq) indicators. XGBoost consistently outperformed other ML algorithms across environmental categories, confirming its utility in genotype-environment-impact prediction pipelines. Principal component and correlation analyses further revealed key trade-offs between stability and footprint, underscoring the complexity of multi-objective selection. Together, these results demonstrate the flexibility and robustness of the genomic-LCA-ML framework for guiding the development of climate-smart ideotypes across distinct crop systems and production ecologies.
Funder
Publisher
University of Galway
Publisher DOI
Rights
CC BY-NC-ND