An systematic evaluation on leading large language models and their factuality investigation as question answering systems