Podcast Episode on LLMs for Software Engineering


In this (german-language) episode of Code for Thought, Carina Haupt (DLR) and I talk with host Peter Schmid about large language models in Research Software Engineering: where tools like Copilot genuinely help (routine tasks, tests, small scripts), and where real-world projects expose their limits (builds, requirements, missing project context). I share my current work on matching papers to repositories via embeddings at the project level rather than just functions - so models carry meaningful context across an entire codebase.

We also dig into evaluation: benchmarks rarely capture real developer workflows, ground truths are often shaky, and results are non-deterministic - so you must repeat runs and measure properly. Another under-appreciated shift: faster code generation often means more reviewing and debugging, which changes developer time and attention. Finally, we touch on resource costs and why value should be weighed against compute and energy.

Bottom line: the first hype wave has ebbed, but the practical value remains - provided tasks are scoped well, context is supplied, and outputs are verified. That’s exactly where our research aims: making LLMs genuinely useful for research code rather than merely impressive in demos.

The Episode is available on the Code for Thought Website as well as eg. Spotify and Youtube.