Hi Carlos,
I agree with you about LSTMs are stochastic and if we check the diagnostic plot, we normally will get a different plot in each run. So, ideally, in real use case scenario, I would always prefer to repeat the diagnostic run multiple times to get a robust behavior of the model over time.
I also would recommend the report other measures e.g. root mean squared error (rmse), symmetric mean absolute percentage error (smape), coefficient of determination (r2), mean absolute error (mae) and median absolute error (medae) to measure the forecast fit.
I found this article (https://repository.upenn.edu/cgi/viewcontent.cgi?article=1166&context=wharton_research_scholars) quite interesting for drawing a comparison between a traditional ARIMA and LSTM algorithm. Well, though the paper is interesting, but seem the conclusion was drawn based on experiment in lab with a simple TS.
I will connect you at Linkedin to provide the code link.
Thanks, Sarit