High-fidelity talking avatar video generation