Kaggle Competition: Make Data Count

Setup dataclasses for extracting and organising research papers file types `.pdf` and `.xml` as inputs for model training for Kaggle Competition.

Competition Objectives

The competition entails the following objectives:

Identifies data citations in scientific literature (from full text, e.g. PDFs / XML). In other words: find where a paper mentions some dataset or data resource.
Classifies the kind of citation into two categories:
- Primary: the data was generated or produced as part of the paper’s own study.
- Secondary: the data was reused, derived, or taken from elsewhere (existing datasets)

Kaggle Workspace Repo

kaggle-kernel/make-data-count-workbook-submission.ipynb at main · whoamimi/kaggle-kernelGitHub

My Grief

My model plan was vectorizing the research papers and performing Self-organizing maps based on the cosine similarity scores. Given that the dataset's references were listed at random throughout the paper, I thought it would be easier to just segment the papers by some integer, K, as primary / secondary link detection. In other words, the training data for one research paper would hold the shape: [1, K, 384]

1 = batch size
K = equally segmented data
384 = Default hidden dimension size of the sentence-transformers encoder model. In this case, all-MiniLM-L6-v2 was used and its hidden size is 384 (the dimension of the vector when a sentence is encoded /vectorized i.e. the dimensions of its form represented in latent space)

Unfortunately, ran into server / Kaggle workspace configuration errors so wasn't able to submit the complete model, but the data extraction (objective point 1) was completed.

Despite the competition being over, I am still eager on trying. Hopefully, will be able to complete this model when I find the time to.

Previous2025 Competitions NextText Models

Last updated 25 days ago

Was this helpful?

hashtagCompetition Objectives

hashtagKaggle Workspace Repo

hashtagMy Grief

Competition Objectives

Kaggle Workspace Repo

My Grief