Develop a model to identify plagiarism?

  • to steal and pass off (the ideas or words of another) as one’s own
  • to use (another’s production) without crediting the source
  • to commit literary theft
  • to present as new and original an idea or product derived from an existing source

In other words, plagiarism is an act of fraud. It involves both stealing someone else’s work and lying about it afterward. Follow the steps below for developing a model that identifies plagiarism:

  • Tokenise the document.

  • Use the NLTK [library in Python] for the removal of stopwords from data.

  • Create LDA or SDA of the document and then use the GenSim library to identify the most relevant words, line by line.

  • Use Google Search API to search for those words.emphasised text