Kepler: Part 2 - Data Processing and EDA

This is Part 2 of a series in which we work with data from the Kepler Space Telescope. Here, we learn how to use the Python package Lightkurve to manipulate our data and embark on some exploratory data analysis (EDA).

If you haven’t already, read Kepler: Part 1 - Data Acquisition.

Before we begin writing any code, let’s work out a plan for how we will proceed. We want to

Open the CSV file and extract the TCE data/parameters
Open all FITS files and extract the light curve data

Environment

If you don’t already have Python and Jupyter Notebooks, I suggest using the Anaconda distribution of Python. It includes many useful libraries for scientific computing and allows us to easily create environments for our projects. Specifically, I will be using

Python 3.6.8
conda 4.6.14
macOS 10.14.3

We will now set up our environment.

If you have Anaconda installed

Open Terminal and navigate to your kepler directory from Part 1. Substitute my filepath for yours.

cd /Users/antonio/kepler

Create and activate an environment called kpy

conda create -n kpy python=3.6

conda activate kpy

Install the packages

conda install pandas numpy matplotlib jupyter

conda install --channel conda-forge lightkurve

Launch Jupyter Notebooks

jupyter notebook

Create a new Python 3 notebook

Rename it to ‘kepler_data_processing’ and make sure it is running.

If you do not have Anaconda installed

You will have to emulate the steps above using your preferred method for environment control and package installation, such as pip. Ensure that you install the same packages, as those will all be used below.

pip install pandas numpy matplotlib jupyter lightkurve

Data Processing

Now that our environment is setup, we can begin with the plan we outlined above. We will create a function for every task so that we have a modular system. This also has the benefit of allowing us to quickly test and edit every part.

We begin by importing the libraries we will use.

## Packages to be used
import matplotlib.pyplot as plt
from random import shuffle
import lightkurve as lk
import pandas as pd
import numpy as np
import glob
import time
import sys
import os

def open_files(csv_file):
    '''
    The open_files function opens the csv and extracts the needed data.

    Args:
        csv_file: Should contain desired TCEs and parameters.
          Required Parameter: kepid,
                              av_training_set,
                              tce_plnt_num,
                              tce_period,
                              tce_time0bk.

    Returns:
        tce_data: Panda DataFrame that has the parameters listed above.
    '''

    ## Opening csv as tce_info using pandas
    ## This contains all metadata for all TCEs
    tce_info = pd.read_csv(csv_file)

    ## Removing TCEs with label 'UNK'
    tce_info = tce_info[tce_info.av_training_set != 'UNK']#.reset_index()

    ## Isolate AFP and NTP
    not_pc = tce_info[tce_info['av_training_set'] != 'PC']

    ## Isolate all PCs
    pc_only = all_of_it[tce_info['av_training_set'] == 'PC']

    ## Only keep TCE with tce_plnt_num == 1 and reset the index
    pc_only = pc_only[pc_only.tce_plnt_num == 1].reset_index(drop = True)

    ## Add PCs back to full set
    tce_info = pd.concat([pc_only, not_pc], ignore_index=True, sort=False)

    ## Shuffle the dataframe because all PCs are at the start
    tce_info = tce_info.sample(frac=1).reset_index(drop=True)


    ## Extracting and combining kepids, periods, epochs into one DataFrame
    tce_data = tce_info[['kepid',
                         'av_training_set',
                         'tce_plnt_num',
                         'tce_period',
                         'tce_time0bk']]

    return tce_data

This series is a work in progress. Progress ends here, for now.

Antonio Gordillo Toledo Linkedin

I'm a senior at Cal interested in math, physics, and computation. I'll be using this blog to share my work and projects as I dive deeper into these areas.