let’s
Introduce about
myself

I graduted from University of Texas at Dallas, MSc Computer Science with concentration on Data Science. I'm working as a Data Engineer with CCA175 Cloudera Spark and Hadoop Certified Developer. I also have various experience with machine learning and deep learning especially with Natural Language Processing and Convolution Neural Network.

My favorite hobby is photography. Check my flickr page for photos.

Skills

  • Coding Language: Python, Java, Scala, Javascript
  • Data Engineer: HDFS, Hadoop, Spark, Hive, Sqoop, Flume
  • Machine Learning: Scikit-learn, Keras, Tensorflow, PySpark, Deeplearning4J
  • Database: MySQL, MongoDB, HBase
  • Cloud Service: GCP, AWS
  • OS: Linux, Window

Social Media

  • Linkedin:
  • Github:
  • Flickr:

Projects & Research

Agile User Story Point Estimation

Effort estimation is an important part of software project management. Cost and schedule overruns create more risk for software projects. Effort estimations can benefit project manager and clients to make sure that projects can complete in time.In modern Agile development, software is developed through repeated cycles (iterations). A project has a number of iterations. An iteration is usually a short (usually 2–4 weeks) period in which the development team designs, implements, tests and delivers a distinct product increment, e.g. a working milestone version or a working release. Each iteration requires the completion of a number of user stories, which are a common way for agile teams to express user requirements.

Code Presentation

Toxic Comment Classification

The aim of the project is to categorize the toxic comments based on the types of toxicity. Examples of toxicity types can be toxic, severely toxic, obscene, threat, insult, identity hate. Different machine learning techniques like Logistic Regression, Support Vector Machines and Naive Bayes are implemented to determine the 6 types of toxic comments. Data: The dataset we are using for toxic comment classification is taken from Kaggle competition which can be found at Kaggle. Dataset has a large number of comments from Wikipedia talk page edits. They have been labeled by human raters for toxic behavior.

Code

Complementing Global and Local Contexts in Representing API Descriptions to Improve API Retrieval Tasks

We present D2Vec, a neural network model that considers two complementary contexts to better capture the semantics of API documentation. First, we connect the global context of the current API topic under description to all the text phrases within the description of that API. Second, the local orders of words and API elements in the text phrases are maintained in computing the vector representations for the APIs. We conducted an experiment to verify two intrinsic properties of D2Vec's vectors: 1) similar words and relevant API elements are projected into nearby locations; and 2) some vector operations carry semantics. We demonstrate the usefulness and good performance of D2Vec in three applications: API code search (text-to-code retrieval), API tutorial fragment search (code-to-text retrieval), and mining API mappings between software libraries (code-to-code retrieval). Finally, we provide actionable insights and implications for researchers in using our model in other applications with other types of documents.

Published Paper

Statistical Learning of API Fully Qualified Names in Code Snippets of Online Forums

Software developers often make use of the online forums such as StackOverflow (SO) to learn how to use software libraries and their APIs. However, the code snippets in such a forum often contain undeclared, ambiguous, or largely unqualified external references. Such declaration ambiguity and external reference ambiguity present challenges for developers in learning to correctly use the APIs. In this paper, we propose ST a statistical approach to resolve the fully qualified names (FQNs) for the API elements in such code snippets. Unlike existing approaches that are based on heuristics, ST has two well-integrated factors. We first learn from a large training code corpus the FQNs that often co-occur. Then, to derive the FQN for an API name in a code snippet, we use that knowledge and also leverage the context consisting of neighboring API names. To realize those factors, we treat the problem as statistical machine translation from source code with partially qualified names to source code with FQNs of the APIs. Our empirical evaluation on real-world code and StackOverflow posts shows that ST achieves very high accuracy with 97.6% precision and 96.7% recall, which is 16.5% relatively higher than the state-of-the-art approach.

Published Paper