githubEdit

kKaggle Competition: Make Data Count

Setup dataclasses for extracting and organising research papers file types `.pdf` and `.xml` as inputs for model training for Kaggle Competition.

Competition Objectives

The competition entails the following objectives:

  1. Identifies data citations in scientific literature (from full text, e.g. PDFs / XML). In other words: find where a paper mentions some dataset or data resource.

  2. Classifies the kind of citation into two categories:

    • Primary: the data was generated or produced as part of the paper’s own study.

    • Secondary: the data was reused, derived, or taken from elsewhere (existing datasets)

Kaggle Workspace Repo

My Grief

My model plan was vectorizing the research papers and performing Self-organizing maps based on the cosine similarity scores. Given that the dataset's references were listed at random throughout the paper, I thought it would be easier to just segment the papers by some integer, K, as primary / secondary link detection. In other words, the training data for one research paper would hold the shape: [1, K, 384]

  • 1 = batch size

  • K = equally segmented data

  • 384 = Default hidden dimension size of the sentence-transformers encoder model. In this case, all-MiniLM-L6-v2 was used and its hidden size is 384 (the dimension of the vector when a sentence is encoded /vectorized i.e. the dimensions of its form represented in latent space)

Unfortunately, ran into server / Kaggle workspace configuration errors so wasn't able to submit the complete model, but the data extraction (objective point 1) was completed.

Despite the competition being over, I am still eager on trying. Hopefully, will be able to complete this model when I find the time to.

Last updated

Was this helpful?