Need help with my Python question – I’m studying for my class.
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, you will play detective, and put your new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. To assist you in your detective work, we’ve combined this data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.
You should have Python 2.7 and sklearn running on your computer, as well as the starter code (folder will provide)
poi_id.py : Starter code for the POI identifier, you will write your analysis here.
final_project_dataset.pkl : The dataset for the project, more details below.
tester.py : You don’t need to do anything with this code, but we provide it for transparency and for your reference.
emails_by_address : this directory contains many text files, each of which contains all the messages to or from a particular email address. It is for your reference, if you want to create more advanced features based on the details of the emails dataset. You do not need to process the e-mail corpus in order to complete the project.
Steps to Success
We will provide you with starter code that reads in the data, takes your features of choice, then puts them into a NumPy array, which is the input form that most sklearn functions assume. Your job is to engineer the features, pick and tune an algorithm, and to test and evaluate your identifier.
As preprocessing to this project, we’ve combined the Enron email and financial data into a dictionary, where each key-value pair in the dictionary corresponds to one person. The dictionary key is the person’s name, and the value is another dictionary, which contains the names of all the features and their values for that person. The features in the data fall into three major types, namely financial features, email features and POI labels.
financial features: [‘salary’, ‘deferral_payments’, ‘total_payments’, ‘loan_advances’, ‘bonus’, ‘restricted_stock_deferred’, ‘deferred_income’, ‘total_stock_value’, ‘expenses’, ‘exercised_stock_options’, ‘other’, ‘long_term_incentive’, ‘restricted_stock’, ‘director_fees’] (all units are in US dollars)
email features: [‘to_messages’, ’email_address’, ‘from_poi_to_this_person’, ‘from_messages’, ‘from_this_person_to_poi’, ‘shared_receipt_with_poi’] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)
POI label: [‘poi’] (boolean, represented as integer)
You are encouraged to make, transform or rescale new features from the starter features. If you do this, you should store the new feature to my_dataset, and if you use the new feature in the final algorithm, you should also add the feature name to my_feature_list, so your evaluator can access it during testing. For a concrete example of a new feature that you could add to the dataset, refer to the lesson on Feature Selection.
In addition, we advise that you keep notes as you work through the project. As part of your project submission, you will compose answers to a series of questions (will provide) to understand your approach towards different aspects of the analysis. Your thought process is, in many ways, more important than your project and we will by trying to probe your thought process in these questions.
When making your classifier, you will create three pickle files (
my_feature_list.pkl). You should also include your modified
poi_id.py file in case of any issues with running your code or to verify what is reported in your question responses.
Text File Listing Your References
A list of Web sites, books, forums, blog posts, github repositories etc. that you referred to or used in this submission (add N/A if you did not use such resources). Please carefully read the following statement and include it in your document “I hereby confirm that this submission is my work. I have cited above the origins of any parts of the submission that were taken from Websites, books, forums, blog posts, github repositories, etc.