Facebook Transcoder

Last Updated : 31 Jan, 2023

Transcoder was proposed by researchers at Facebook in September 2020 in the paper titled “Unsupervised Translation of Programming Languages”. The goal of the project was to train AI to understand the code in a different language and able to convert the code from one language to another. Many of the companies have code base in old programming languages like COBOL, and if they want to shift their codebase to newer programming languages like Java, C++, or Python, they required a lot of money and effort to do that. In that case, Transcoder can save a lot of resources.

Model :

For TransCoder, the authors use a sequence-to-sequence (seq2seq) model with attention. The seq2seq architecture is consists of an encoder and a decoder with a transformer architecture. The authors use a single shared model for all programming languages. For training the model they proposed three principles:

Cross Programming language model (XLM) Pre-training: Pre-training is an important part as it ensures that if a different piece of code expressing the same instruction should be mapped to the same representation regardless of programming language. The author observed that pre-training with masked language modeling objective leads to significant improvements in pre-training on monolingual source code
De-noising Auto Encoding: Since the XLM pre-training generates high-quality embedding for the encoder architecture but still, the decoder lacks the training to translate the code. Therefore, the model trained with Demising -Encoding objective where the model is trained to predict a sequence of tokens given a corrupted version of that sequence.
Back Translation: In practice, XLM pre-training and denoising auto-encoding alone is enough to generate translations. However, the quality of these translations tends to low, as the model is never trained to translate functions from one language to another. To address this issue, the authors use the back-translation method. This back-translation method is commonly used where we have a large monolingual (of one language) dataset. In this method, the model is trained to translate source to target and vice- versa parallelly.

Datasets and Training details:

Training Architecture: The authors use transformer architecture with 6 layers, 8 attention heads, and set the dimensionality of the model to 1024. They use a single encoder and a single decoder for all programming languages. After pre-training, the authors trained the model with Denoising Auto Encoding and Back translation objective alternatively. For optimizing the model, the authors use Adam Optimizer.
Preprocessing: The authors use javalang tokenizer for Java, tokenizer of the standard library for Python, and the clang tokenizer for C++. These tokenizers ensure that meaningless modifications in the code do not have any impact on the tokenized sequence.

t-SNE visualization of cross-lingual token embedding. Embedding of Similar token in a different language are closer to each other

Datasets Used: For training, the authors use the Github public dataset available on Google BigQuery and filter the repositories which permit redistribution of parts of projects and select java, python, and C++ codes. The authors use function level code for training evaluation because they are short enough to fit in a single batch and allows easy evaluation by using the output generated from that function. For validation and testing, the authors used multiple language solutions for problems on the GeeksforGeeks platform.

Implementation:

Python3

# First clone the git repository for transcoder
# in local environment
! git clone https://github.com/facebookresearch/TransCoder transcoder/
   
# Download the model files (link given in official
# git repository)
! wget https://dl.fbaipublicfiles.com/transcoder/model_1.pth /
 
 # Since transcoder is implemented in pytorch, 
 # we need to install pytorch first
! pip install torch torchvision
 
# Now install other required documentations
! pip install numpy fastBPE Moses Apex libclang submitit six sacrebleu==1.2.11
 
# Now we run the translate.py file with following arguments:
# src_lang = source language file
# tgt_lang = target language file
# model_path = path of the model which we downloaded above
#  > file.java/cpp/py = file which we want to convert
# the command below may take some time to run
! sudo python transcoder/translate.py - -src_lang java - -tgt_lang python - -model_path model_1.pth > code2.java

Input	TransCoder Output
C++ void worstFit(int blockSize[], int m, int processSize[], int n) { int allocation[n]; memset(allocation, -1, sizeof(allocation)); for(int i = 0; i < n; i ++){ int wstIdx = – 1; for(int j = 0; j < m; j++){ if(blockSize[j] >= processSize[i]){ if(wstIdx == – 1) wstIdx = j; else if(blockSize[wstIdx] < blockSize[j]) wstIdx = j; } } if(wstIdx != -1){ allocation[i] = wstIdx; blockSize[wstIdx] -= processSize[i]; } } cout << “\nProcess No.\tProcess Size\tBlock no.\n”; for(int i = 0; i < n; i++){ cout << ” ” << i+1 << “\t\t” << processSize[i] << “\t\t”; if(allocation[i] != -1) cout << allocation[i] + 1; else cout << “Not Allocated”; cout << endl; } }	Python def worstFit(blockSize, m, processSize, n): allocation = [-1] * n for i in range(n): wstIdx = -1 for j in range(m): if blockSize[j] >= processSize[i]: if wstIdx == -1: wstIdx = j elif blockSize[wstIdx] < blockSize[j]: wstIdx = j if wstIdx != -1: allocation[i] = wstIdx blockSize[wstIdx] -= processSize[i] print(“\nProcess No.\tProcess Size\tBlock no.\n”) for i in range(n): print(” ” + str(i+1) + “\t\t” + str(processSize[i]) + “\t\t”) if allocation[i] != -1: print(allocation[i] + 1) else: print(“Not Allocated”) print()
Java public static int max(int a, int b){ return a > b ? a : b; } public static void createDirectory(Path path) throws IOException { if(!Files.exists(path)) { Files.createDirectories(path); } }	Python def max(a, b): return a if a > b else b def create_directory(path): if not os.path.exists(path): os.makedirs(path)

Input

TransCoder Output

C++

void worstFit(int blockSize[], int m, int processSize[], int n)

{

int allocation[n];

memset(allocation, -1, sizeof(allocation));

for(int i = 0; i < n; i ++){

int wstIdx = – 1;

for(int j = 0; j < m; j++){

if(blockSize[j] >= processSize[i]){

if(wstIdx == – 1)

wstIdx = j;

else if(blockSize[wstIdx] < blockSize[j])

wstIdx = j;

}

if(wstIdx != -1){

allocation[i] = wstIdx;

blockSize[wstIdx] -= processSize[i];

}

cout << “\nProcess No.\tProcess Size\tBlock no.\n”;

for(int i = 0; i < n; i++){

cout << ” ” << i+1 << “\t\t” << processSize[i] << “\t\t”;

if(allocation[i] != -1) cout << allocation[i] + 1;

else

cout << “Not Allocated”;

cout << endl;

}

Python

def worstFit(blockSize, m, processSize, n):

allocation = [-1] * n

for i in range(n):

wstIdx = -1

for j in range(m):

if blockSize[j] >= processSize[i]:

if wstIdx == -1:

wstIdx = j

elif blockSize[wstIdx] < blockSize[j]:

wstIdx = j

if wstIdx != -1:

allocation[i] = wstIdx

blockSize[wstIdx] -= processSize[i]

print(“\nProcess No.\tProcess Size\tBlock no.\n”)

for i in range(n):

print(” ” + str(i+1) + “\t\t” +

str(processSize[i]) + “\t\t”)

if allocation[i] != -1:

print(allocation[i] + 1)

else:

print(“Not Allocated”)

print()

Java

public static int max(int a, int b){

return a > b ? a : b;

}

public static void createDirectory(Path path) throws IOException

{

if(!Files.exists(path))

{

Files.createDirectories(path);

}

Python

def max(a, b):

return a if a > b else b

def create_directory(path):

if not os.path.exists(path):

os.makedirs(path)

Results and Evaluation methods:

Evaluation Metric: For the most part of studies, the authors used the BLEU score, another metric used by authors called Reference match which simply means the percentage of translations that perfectly match the ground truth reference. However, these metrics do not ensure the syntactic correctness of the program. The author used another metric called Computational Accuracy, which compares the output of source function and hypothesis function.
Beam Search: The authors experimented with two evaluation metrics, either by taking the model only returns the hypothesis with the highest log-probability (Greedy search) or by taking top-N log probabilities. The authors found significant improvement in accuracy when they used beam search, up to 33.7% in Java → Python with Beam of N=25.
Below are the results when they used Greedy Search.

	C++ → Java	C++ → Python	Java →Python	Java → C++	Python → C++	Python → Java
Reference Match	3.1	6.7	24.7	3.7	4.9	0.8
BLUE score	85.4	70.1	97.0	68.1	65.4	64.6
Computational Accuracy	60.9	44.5	80.9	35	32.2	24.7

Below are the evaluation results based on BEAM search of N=25:

Beam Search (N=25)	C++ → Java	C++ → Python	Java →Python	Java → C++	Python → C++	Python → Java
Computational Accuracy	74.8	67.2	91.6	68.7	57.3	56.1

The authors compared Transcoder with two existing frameworks j2py (java-to-python) with 38.3 computational accuracy and Tangible software’s C++-to-Java converter with 61% computational accuracy. Transcoder clearly outperforms both of these.

References: