Translate with ChatGPT

OpenAI created ChatGPT, a chatbot. It is based on the instructionGPT: It has been taught to follow and respond to instructions, or “prompts,” supplied by users. translate 

ChatGPT exhibits excellent ability in generating thorough and clear responses to user inquiries. It appears to excel in natural language processing (NLP) tasks like summarization, question answering, language production, and machine translation

. Download APP Translate


Translate: with ChatGPT

Nevertheless, because it is a new system, ChatGPT has to be fully assessed scientifically in order to compare its NLP performance to past work.

Tencent AI presented a preliminary investigation on ChatGPT’s capacity to translate in this direction:

Is ChatGPT a Reliable Translator? The term “Creative Commons” refers to the use of Creative Commons licenses (Tencent AI)


The primary goal of this research is to assess ChatGPT for text translation into English, as the majority of its training data is in English. As noted in the blog article, ChatGPT is built on instructions. InstructGPT is a GPT-3 that has been fine-tuned with prompts that are “primarily in English” (Ouyang et al., 2022). Also, 93% of the pre-training data for GPT-3 is in English (Brown et al., 2020). translate

They also assess translation into languages that are significantly underrepresented in its training data, such as Japanese and Romanian, and hence more difficult.

In this essay, I will evaluate and explain their primary findings, focusing on what appears to work and what does not when utilizing ChatGPT as a machine translation system.

Our Work

We try our very best to make cool things that people find useful. All over the world, every day, we help thousands of people save their valuable time through the use of our tools:




  1. Translate English to Indonesian
  2. Translate English to Spanish
  3. Translate English to Portuguese
  4. Translate English to Arabic
  5. Translate English to French
  6. Translate English to Russian
  7. Translate English to Italian
  8. Translate Spanish to English
  9. Translate English to Chinese (Simplified)
  10. Translate French to English
  11. Translate German to English
  12. Translate Russian to English
  13. Translate Indonesian to English
  14. Translate English to Vietnamese
  15. Translate English to Ukrainian
  16. Translate English to Hindi
  17. Translate English to Persian
  18. Translate Hindi to English
  19. Translate Italian to English
  20. Translate English to Chinese (Traditional)

languages supported and growing

  1. Afrikaans
  2. Albanian
  3. Amharic
  4. Arabic
  5. Armenian
  6. Azerbaijani
  7. Basque
  8. Belarusian
  9. Bengali
  10. Bosnian
  11. Bulgarian
  12. Catalan
  13. Cebuano
  14. Chichewa
  15. Chinese (Simplified)
  16. Chinese (Traditional)
  17. Corsican
  18. Croatian
  19. Czech
  20. Danish
  21. Dutch
  22. English
  23. Esperanto
  24. Estonian
  25. Filipino
  26. Finnish
  27. French
  28. Frisian
  29. Galician
  30. Georgian
  31. German
  32. Greek
  33. Gujarati
  34. Haitian Creole
  35. Hausa
  36. Hawaiian
  37. Hebrew
  38. Hindi
  39. Hmong
  40. Hungarian
  41. Icelandic
  42. Igbo
  43. Indonesian
  44. Irish
  45. Italian
  46. Japanese
  47. Javanese
  48. Kannada
  49. Kazakh
  50. Khmer
  51. Kinyarwanda
  52. Korean
  53. Kurdish (Kurmanji)
  54. Kyrgyz
  55. Lao
  56. Lati

The Decline Of Statistical Significance Testing

Dror et al. (2018) conducted one of the most current and widely recognized research on the use of statistical significance testing in NLP. One of their main discoveries is that NLP researchers place a lot of attention on empirical data and form conclusions based on them without first examining their statistical significance. After reviewing 213 articles from ACL and TACL 2017, they discovered that only 36.6% (78 publications) claimed to have undertaken statistical significance testing, implying that 63.4% did not examine whether their observations were coincidental.


I expanded my study over the last 12 years by focusing on machine translation research (my area of expertise) papers published at ACL conferences. I carefully annotated 913 documents in total. The authors of all the annotated publications compared machine translation systems using an automated assessment metric such as BLEU.

Translation Prompt

One of the most crucial aspects of dealing with generative language models is prompt design.

Given our desired task, we must find an acceptable natural language formulation to query the model. In this case, we want ChatGPT to translate a statement from a source language (denoted “[SRC]”) into a target language (denoted “[TGT]).

Tencent AI explicitly asked ChatGPT to provide 10 prompts with the following prompt:

ChatGPT returned ten prompts, with just minor changes between them. They eventually decide to test only the three questions listed below, which are the most typical of the ten prompts first supplied by ChatGPT:

Prompt 1: Convert the following phrases from [SRC] to [TGT]:
Prompt 2: Respond without using quotation marks. What do these phrases in [TGT] mean?
Prompt 3: Provide the [TGT] translation for the following sentences:


They tested each of these prompts on a Chinese-to-English translation task ([SRC]=Chinese, [TGT]=English), and the results were as follows:

Three automated measures for measuring machine translation quality are BLEU, chrF++, and TER. Higher scores are preferable for BLEU and chrF++. Lower TER scores are preferable.

Based on the results of these three criteria, they discovered that Prompt 3 performs the best. Even though the chrF++ scores appear to be comparable, Prompt 2 appears to be superior to Prompt 1.


This is noteworthy since Prompt 1 cites the source language whereas the other two do not. Prompt 1 nevertheless underperforms. ChatGPT does not need to be aware of the language of the content to be translated.

This is both stunning and counter-intuitive. Because of the clarity of the source language in its prompts, we may have anticipated ChatGPT to be more precise. Knowing the original language is essential for human translators.

There is currently no solid explanation for why ChatGPT scores lower when specifying the source language. We may presume that ChatGPT can deduce the source language from user input. If this is the case, then giving the source language should have no effect, as opposed to the negative impact seen in Tencent AI findings.

General translation

Now that we’ve established a decent prompt, we can compare ChatGPT against cutting-edge machine translation systems.

Tencent AI picked Google Translate, DeepL, and their own online system, Tencent TranSmart, for comparison.

The outcomes are as follows:

The three online systems perform similarly and appear to outperform ChatGPT, despite the fact that the authors do not report on statistically significant testing to ensure that the differences are indeed significant.

Still, I thought these results to be outstanding. Because it is built on instructGPT, we may presume that ChatGPT was trained mostly on English data, but it appears to be capable of capturing the meaning of Chinese utterances well enough to provide English translations.

If we could fine-tune ChatGPT for Chinese-to-English translation, we would undoubtedly achieve a far higher-quality translation.

Tecent AI also reports on comparable discrepancies in all translation directions between English, Chinese, German, and Romanian in the article.


Once again, the performances (in BLEU) are outstanding. ChatGPT may create translations even for translation directions that do not use English, such as German-to-Chinese. According to BLEU, online systems continue to improve, as predicted given their training. ChatGPT isn’t one of them!

The outcomes for Romanian are significantly different. For example, when compared to online systems, ChatGPT has an almost 50% lower BLEU score. This disparity is most likely statistically significant.

An explanation is proposed by the writers. Romanian has significantly fewer resources available, such as Romanian text on the Internet, than German or Chinese. ChatGPT may have encountered too few instances of Romanian utterances during training to correctly model them.

Domain and Robustness

They conducted further trials to assess ChatGPT’s efficacy in translating texts in a specific domain (biomedical) and user-generated texts (posted on social media, usually very noisy with grammatical errors).

Limitations of this study

The authors admit in their paper that further tests with more language pairings are needed to adequately assess the translation quality of ChatGPT.

The term “sustainability” refers to the process of reducing the amount of waste generated by the use of a product such as a sandbox.

The lack of human evaluation is the main limitation of this work.

The influence of the prompt, in my view, may also be examined further. The authors used an unusual approach by allowing ChatGPT to propose suggestions. Yet, asking ChatGPT to propose prompts is a chicken-and-egg situation. The prompt used to get prompts for machine translation may have a significant influence on all of the subsequent trials done in this study. Prior work on the prompt design for machine translation experimented with a wide range of handmade prompts.


ChatGPT excels in machine translation.

We can already deduce from this early investigation that ChatGPT would be good, and maybe even better than traditional online systems, at translating material with the characteristics of ChatGPT’s training data, such as noisy user-generated texts in English.

Nevertheless, as predicted, ChatGPT lags behind more traditional machine translation systems when translating into languages other than English, particularly distant or low-resource languages like Japanese or Romanian.







Leave a Comment