Tuesday, February 6, 2018

A Note on NSS Data Extraction 1

Extracting Data from NSS 62nd Round
Using STATA:

First we shall try to extract NSS-data for Consumer Expenditure (62_1.0), which is in ASCII-fixed format using STATA (I used STATA 9).

Step 1: Locate the Folder/CD containing data for NSS-Round 62. It should contain three sub-folders, Nss62_1.0 for Consumer expenditure data, Nss62_2.2 for Manufacturing enterprises data, and Nss62_10 for Employment and unemployment data. We are going to use the folder containing data on Consumer expenditure, i.e., Nss62_1.0.

Step 2: The folder Nss62_1.0 should have four items within it, three folders and one text document (README62_1.0). You can find the State Code for the State of interest by following the path:
Nss62_1.0 → Supporting Documents → Instrn. To Field Staff_62 → Appendix-2.

For Uttar Pradesh, the State code is 09 (It has been changed from 25 in NSS 61st Round).
It is also evident from Appendix II that Uttar Pradesh has been divided into four NSS-Regions. So in this way, we would be concerned with code 091 (signifying the 1st NSS-region in Uttar Pradesh) to 094 (indicating 4th NSS-region in Uttar Pradesh). This would be helpful in data extraction.

Step 3: Locate the “layout_62_1.0” file that is an excel file by following path:
Nss62_1.0 → Supporting Documents → Layout_62_1.0.
We shall call it Layout file for convenience. The Layout file shows various “Levels” containing “Blocks” of the Schedule that was administered to gather the information from respondents. You will notice that there are seven Levels in all. Now starts the real job.

Step 4: Now, we need to extract data for each level separately as their contents differ. For example, the variable of Level 1 would not be there, except initial 17 items (up to HHS No.) Thus, we need to write as many Dictionary files as the number of Levels. This makes us write 7 Dictionary files for extracting the Consumer Expenditure data of Uttar Pradesh.

Following is an example of a dictionary file (for Level 1) for STATA:

dictionary using MH1C01.dat
{
_column(1) str3 CntrcdRndshft %3s
_column(4) str5 LOTFSU %5s
_column(9) str1 Filler %1s
_column(10) str2 Round %2s
_column(12) str3 Schdlno %3s
_column(15) str1 Sample %1s
_column(16) str1 Sector %1s
_column(17) str3 StateReg %3s
_column(20) str2 District %2s
_column(22) str2 Strtmno %2s
_column(24) str2 Substrtum %2s
_column(26) str1 SubRund %1s
_column(27) str1 Subsmple %1s
_column(28) str4 FODSubReg %4s
_column(32) str1 Segmntno %1s
_column(33) str1 SecndStg %1s
_column(34) str2 HHSno %2s
_column(36) str2 Level %2s
_column(43) str2 SlnoInfor %2s "Seriel Number of Informant"
_column(45) str1 Respnscd %1s "Response Code"
_column(46) str1 Survycd %1s
_column(47) str1 Substncd %1s "Substitution Code"
_column(48) str6 DoSurvy %6s
_column(54) str6 DoDespch %6s
_column(60) str3 TCanvs %3s
_column(127) str3 Nss %3s
_column(130) str3 Nsc %3s
_column(133) str10 Mlt %10s
}

Step 5: Once all the seven dictionary files have been written, we would store in a folder, say “NSS extract”.

Step 6: Now we need to write a “do-file” for STATA. I’m pasting below the do-file that I’ve written and used to extract the NSS data for all seven Levels at one go.

infile using "E:\NSS extract\Level1.dct", using("E:\Nss62_1.0\Data\MH1C01.TXT")
drop if StateReg > "094"
keep if Level == "01"
destring, replace
save "E:\How to\Level1.dta", replace
clear

infile using "E:\NSS extract\Level2.dct", using("E:\Nss62_1.0\Data\MH1C01.TXT")
drop if StateReg > "094"
keep if Level == "02"
destring, replace
save "E:\How to\Level2.dta", replace
clear

infile using "E:\NSS extract\Level3.dct", using("E:\Nss62_1.0\Data\MH1C01.TXT")
drop if StateReg > "094"
keep if Level == "03"
destring, replace
save "E:\How to\Level3.dta", replace
clear

infile using "E:\NSS extract\Level4.dct", using("E:\Nss62_1.0\Data\MH1C01.TXT")
drop if StateReg > "094"
keep if Level == "04"
destring, replace
save "E:\How to\Level4.dta", replace
clear

infile using "E:\NSS extract\Level5.dct", using("E:\Nss62_1.0\Data\MH1C01.TXT")
drop if StateReg > "094"
keep if Level == "05"
destring, replace
save "E:\How to\Level5.dta", replace
clear

infile using "E:\NSS extract\Level6.dct", using("E:\Nss62_1.0\Data\MH1C01.TXT")
drop if StateReg > "094"
keep if Level == "06"
destring, replace
save "E:\How to\Level6.dta", replace
clear

infile using "E:\NSS extract\Level7.dct", using("E:\Nss62_1.0\Data\MH1C01.TXT")
drop if StateReg > "094"
keep if Level == "07"
destring, replace
save "E:\How to\Level7.dta", replace

Further Explanation regarding the do-file:
“E” is the drive on your PC where you made the folder for storing the extracted data. It can as well be “C” or whatever.

“NSS extract” is the folder where the “dictionary files” have been stored and where the extracted data would be stored.

“E:\Nss62_1.0\Data\MH1C01” is the path for finding the ASCII-fixed format data. It may differ depending on where one has stored the NSS data which is to be extracted.

“drop if StateReg > “”094” is given to drop all the irrelevant data. When we import data it would also be having data for several other States as they were stored in “MH1C01.txt”. We need here only data for Uttar Pradesh. Remember that Uttar Pradesh has NSS-code 09 and it has been divided into four NSS-regions. So, the data of our interest should be that of Region 1, Region 2, Region 3, and Region 4 in Uttar Pradesh. The “Layout file” has a variable by name “State-Region” that we had named as “StateReg” in our dictionary file. This variable would help us to identify the relevant data for Uttar Pradesh. When we give command “drop is StateReg > “”094”, STATA would drop from its memory the data for all other States except Uttar Pradesh. “094” is the maximum coding that can be applied to the regions of Uttar Pradesh under given data. 094 is enclosed within quotation mark as we have extracted all variables in String format.

“Keep if Level == “X”” makes the STATA to further drop data for all levels except the “Level X” where X can be any whole number between 1 to 7, depending on the level of our choice.

“destring, replace” is used to destring all the data. Remember we have imported data in String format.
So, extract data and have fun...!

N.B. I'm obliged to my mentor who bore the brunt of my experiments with NSS-data for I went to him and discussed my hopelessly frustrating strategies to extract the data at one go...!
This is reposted as per request of so many users.

39 comments:

  1. Great post!
    do you know any method to use "if" in dictionary file so that we can degenerate data with just one dictionary?
    I mean for example we know that in ASCII file column 30 & 31 are the level code, can we write a dictionary that can check these two columns and according to that seperate other columns?

    ReplyDelete
  2. I have another question, why do you use string format in dictionary file? can we use int or byte instead?

    ReplyDelete
  3. Dear Mohammad,
    Thanks for liking the post.
    Regarding the use of 'if' so that we can write a single dictionary file for all levels, well I've not tried it and at the outset, seems difficult, if not impossible. The different levels have only some item in common. Even if we do that, it may me very much prone to errors. You would have noticed that my approach had an emphasis on simplicity. There are so many ways for extracting data and I chose the one that seemed easiest to me.
    Regarding the use of 'string' format: yes you can use any format but I was not getting very coherent result with 'float' format or others. Why don't you try yourself just replacing 'str' with 'float' or ''int/byte' formats and see what changes. Then we can discuss it further.

    ReplyDelete
  4. Dear Durgesh
    Ok, I'll try to do with float format and will tell you about the results.
    About "if" I've searched very much and couldn't find any good result. I think your method works best in this case.
    However, now I have problem with Layout files in databases. I have 20 databases from Nss38 to NSS62, but their layout file are in text format and are very bad arranged, did you find any trick to deal with this problem?

    ReplyDelete
  5. Dear Mohammad,
    I started only with NSS-62nd round so havn't tried other rounds. I feel there are so may differences that have crept in various rounds of NSS in terms of layout, so be very cautious.
    By the way, I would love to know where are you from and where you work.

    ReplyDelete
  6. I am from Iran and now a MPhill student in Tilburg University, The Netherlands. this work is a research assistantship that I do. you are probably from India and I am very happy to meet your blog.

    ReplyDelete
  7. Dear Mohammad,
    Yes, I'm from India, doing my post-doctoral research at IGIDR-Mumbai. Nice to know that work put on blog is of use to others. When I started with NSS-data one month back, I was baffled as not able to extract it properly. Then my mentor suggested me to write a note about extraction procedure, once I learn it, and post it on web so that others should not get troubled.
    Working on NSS data for 32 different rounds is very ambtious and all these rounds are not strictly comparable

    ReplyDelete
  8. Interesting points on extracting data, I use python for simple extracting data,data extraction can be a time consuming process but for larger projects like documents, the web, or files i tried "extracting data" which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs

    ReplyDelete
  9. Well your blog has all the intelligent stuff. You have compiled very useful data here with all the details and steps to follow.
    CD Rates

    ReplyDelete
  10. Dear Durgesh
    do you know what is the multipliers in the data sets? i am really confused about them and i don't know what is the use of them?
    actually I want to compute the gini index in different states, what's your idea about that? just aggregate the expenditures for each observation and use the gini-index formula?

    ReplyDelete
  11. Dear Mohammad,
    Multiplier is something that should be used to give each household its due weightage, i.e., how many houses in the Population each house in our data represents. Some people don't use them but then the results may be serioulsy flawed.
    Initially, I'd not mentioned extracting the multipliers but I rectified the mistake later. Just re-read the extraction syntax.

    You can refer to a book by Angus Deaton: Analysis of Household Data.

    ReplyDelete
  12. Dear Durgesh
    Thank you for helping. I have another question about NSS62_1.0, how Urban/Rural is addressed for each observation? actually, I work with about 15 rounds of data, in some of them data sets already divided according to urban/rural but in NSS62 are not. do you know how can i identify them in each state?

    ReplyDelete
  13. Dear Mohammad,
    The Rural is numbered as 1 and Urban as 2. You should read the layout file.

    ReplyDelete
  14. Dear Durgesh

    Sorry for disturbing you again. I am really confused with 62 round dataset, I want to compute total expenditure of each household in different levels, but I don't know how i can generate an id for each household. the layout file is very ambiguous for me! I don't know which variables are required to identify one household? can you help me?

    ReplyDelete
  15. Dear Durgesh
    I finally find out how can i identify each household! each house hold can be identified with 4 variables:LOT/FSU number, Segment Number, Second Stage Stratum and HHS No. but finding this with just the layout file was very difficult!
    just one question, the rural/urban variable is named "sector" in layout file? because the layout file which i have does not any explanation about that.

    ReplyDelete
  16. Yes, sector 1 is Rural and 2 Urban. Am in travel so couldn't reply quickly.

    ReplyDelete
  17. Dear Mohammad,
    Am back to business so you can discuss the data issues. What are you doing exacty with the data?

    ReplyDelete
  18. Dear durgedh

    I wanted to compute the gini index for 52 to 62 rounds in each state and sector and it is almost done. my problem was to identify each household to avoid duplicates in calculations. for each year the layout files and name of variables are different from others and that made the work difficult.
    anyway, now the gini calculation is finished and I hope there wasn't any mistake. I can send my results to you. if I have any other question i will ask you.

    thank you very much for your helps

    ReplyDelete
  19. Dear Mohammad,
    You're always welcome.
    I, too, am learning to deal with the NSS-data, as I told you in very biggining. So, we both can learn from each-other.
    For identifying the household, you correctly noticed, we need four things: Lotfsu, Segment number, Secondstage stratum, and Household seriel number.
    Yes, NSS data files have different layouts for each round...they, too, are learning...:)

    You can directly mail me at pathakdc@gmail.com or durgesh@igidr.ac.in

    ReplyDelete
  20. It will be great if you kindly tell me how to convert .txt files that is, the actual raw data files of NSS into a "dat" file first?

    ReplyDelete
  21. I think you need to extract the data as mentioned in the procedure at the blog and then the data would be saved aas .dta file.
    I can help more if you be specific about your question.

    ReplyDelete
  22. Hi Durgesh,
    This is a wonderful post. I am looking for the variable description for NSS43rd round as to how the household type, relation to head and occupation is coded. Would you be able to help me locate these in soft form is possible. Your help is highly appreciated. Thanks KS

    ReplyDelete
  23. Hi KS,
    I've not gone beyond 50th round in retrospect. You please check the layout file. There the variables of your interest should have been described. If you cannot find that then email me the layout file. I may help then.
    Thanks for writing...

    ReplyDelete
  24. Hi Durgesh,
    Thanks for your reply. I was wondering if you could help me with the lighting code and cooking code in NSS 50. I have the numbers but do not have information on how they are defined. Your help will be highly appreciated. Thanks KS.

    ReplyDelete
  25. Dear KS (?),
    Here are the desired codes:

    Item 21 : primary source of energy for cooking : coke, coal-1, firewood and chips-2, gas (coal, oil or LPG)-3, gobar gas-4, dung cake-5, charcoal-6, kerosene-7, others-8, no cooking arrangement-9.

    Item 22 : primary source of energy for lighting : kerosene-1. other oil-2, .gas-3, candlc-4, electricity-5, Others-8, no lighting arrangement-9.

    This is from the Schedule 50_1.0. If you have the schedule, read it.

    Hope this helps...and don't make acronym of your name

    ReplyDelete
  26. This comment has been removed by the author.

    ReplyDelete
  27. Dear KS,
    When were you at IGIDR and what course you took? You could search these things online also, even my email-id. If you don't get after try, write to me.
    Durgesh

    ReplyDelete
  28. :) I took all the courses which were offered there. As for searching over the internet I had a look before writing to you. Thanks for your retort though!!

    ReplyDelete
  29. Dear KS!
    I was just curious to know whether you did M.Sc or M.Phil from IGIDR. Where are you working now?
    Better email me at pathakdc@gmail.com so that I can send you the Schedule for 50th round.
    Durgesh

    ReplyDelete
  30. Hi,

    This is great stuff. Am a doctoral student at IIM Bangalore working on the 64th round data and your blog was very helpful. I must have saved weeks trying to figure it out. I think I will blog similarly once am done with my project.

    A small note. My stata command for infiling the dct file looked something like this:
    infile using "dct files\level1_ah6.dct" if level==1

    compress

    save "stata files\level1_ah6.dta", replace

    Note the small shortcut avoiding a separate line for 'level' and the compress command that greatly reduces file size.

    cheers,
    Chinmay

    ReplyDelete
  31. Dear Chinmay,
    Thanks for appreciating the effort.

    Actually, there are several ways to read data into Stata and one keeps on learning by doing. You suggested a good shortcut, thanks. Yes, compressing the data will reduce memory requirement.

    Keep in touch,
    Regards,
    Durgesh

    ReplyDelete
  32. Dear Durgesh,

    I am working with the 64th Round dataset and am looking to study inter-state migration in India and how it is affected by wages and distance.

    Could you please help me with how to create this '.dct' file? Which software is needed? Will it contain all the responses from all questions asked in the survey?

    I have all the raw data but do not know how to import it into stata.

    Please help, I am only an undergrad student and very stuck.

    Thanks,
    Sanjay

    ReplyDelete
  33. Hi Sir

    I really like your post on NSS data extraction. I am working on 55th and 61st round of consumer expenditure data and would like to know the use of weights and how should i use them.

    Thanks and Regards
    Geeta

    ReplyDelete
  34. Dear Sanjay,
    Please read the post carefully. I've given the structure of a dictionary file.
    Durgesh

    ReplyDelete
  35. Hi Durgesh,
    Thanks for the post. This was helpful.
    I am currently working on Round 66 data and unable match MPCE-URP i.e. monthly per capita expenditure from Level 10 data with report provided by NSSO team.
    - I have converted the ASCII data to SPSS.
    - Applied weights as per instruction
    - Running aggregate for MPCE – Rural & Urban
    - I get figures which are far from the NSSO report average
    Is there any calculation I am missing
    Thanks!

    ReplyDelete
  36. hi durgesh, i am working on nss 64th round. i am a master student.i am working on a paper where i want to see the imapact of income inequality in the society on the gender bias in education in india.so my dependant variable is a latent variable taking value 1 if the child of age 6- 10 is enrolled in a primary school and o otherwise. now i want to generate gini so that i regress gini on the dependent variable, but i am unable to that.i have fund commands that calculate gini in stata, but i need to generate gini to do my regression.can u help me with this

    ReplyDelete